Live Stream: https://youtu.be/9ZQxvhdOTlA PySpark is a distributed data processing engine widely used in Data Engineering and Data Science. Another way to think of PySpark is a library that allows processing large amounts of data on a single machine or a cluster of machines. We will go through the basic concepts and operations so you will leave the workshop ready to continue learning on your own.
Workshop steps: - Introduction: Motivation, intro to parallel data processing, Spark's main concepts (transformations versus actions, dataframes versus RDDs), and overall architecture, focusing on Spark SQL - Setup environment: There are two ways of executing the notebook with the exercises. The first one is creating an account on Databricks community and cloning the notebook. The alternative is running the notebook locally as described in the instructions. - Exercises: Going through a series of exercises covering Spark's main transformations (filter, select, groupBy) and ways to visualize them. The idea is to give people some time to complete each exercise and then solve it in an interactive way
Speakers: Natalia Pipas