Workshop: Spark for Data Engineers

Workshop material for "Spark for Data Engineers"

Data Analysts, data Scientist, Business Intelligence analysts and many other roles require data on demand. Fighting with data silos, many scatter databases, Excel files, CSV files, JSON files, APIs and potentially different flavours of cloud storage may be tedious, nerve-wracking and time-consuming.

Automated process that would follow set of steps, procedures and processes take subsets of data, columns from database, binary files and merged them together to serve business needs and potentials is and still will be a favorite job for many organizations and teams.

Apache Spark™ is designed to to build faster and more reliable data pipelines, cover low level and structured API and brings tools and packages for Streaming data, Machine Learning, data engineering and building pipelines and extending the Spark ecosystem.

Spark is an absolute winner for this tasks and a great choice for adoption.

Data Engineering should have the extent and capability to do:

System architecture
Programming
Database design and configuration
Interface and sensor configuration

And in addition to that, it is as important as familiarity with the technical tools is, the concepts of data architecture and pipeline design are even more important. The tools are worthless without a solid conceptual understanding of:

Data models
Relational and non-relational database design
Information flow
Query execution and optimisation
Comparative analysis of data stores
Logical operations

Apache Spark have all the technology built-in to cover these topics and has the capacity for achieving a concrete goal for assembling together functional systems to do the goal.

Workshop Title: "Spark for Data Engineers"

Target Audience: Data engineer, BI Engineer, Cloud data engineer

Broader Audiance: Analysts, BI Analysts, Big Data analysts, DevOps data engineer, Machine Learning engineer, Statisticians, Data Scientist, Database Administrator, Data Orchestrator, Data Architect

Prerequisite knowledge for attendees: Data engineering tasks:

analyzing and organizing raw data (with T-SQL or Python or R or Scala)
buidling data transformations and pipelines (with T-SQL or Python or R or Scala)

Technical prerequisite for attendees:

working laptop with ability to install Apache Spark and other tools
Access to internet
Credentials and credit (free credit) for accessing Azure portal

DSP 2022 - September 12th & 13th 2022 - Agenda for two days(12.30 - 16.30PM (CEST); Start and end time can vary and will be finalised with organizator, as well as coffee breaks)

September 12h 2022 -

Module 1 (12.30 – 13.30 AM): Getting to know Apache Spark, Installation and setting up the environment
Coffee Break 15'
Module 2 (13.45 – 15.00): Creating Datasets, organising raw data and working with structured APIs
Coffee Break 15'
Module 3 (15:15 – 16.30): Designing and building pipelines, moving data and building data models with Spark

September 13h 2022 -

Module 3: (12:30 – 12.45): Designing and building pipelines, moving data and building data models with Spark
Module 4: (12.45 - 14.00) Data and process orchestration, deployment and Spark Applications
Coffee break 15'
Module 5: Data Streaming with Spark (14.15 - 15.15)
Break 5'
Module 6: Ecosystem, tooling and community (15.20 - 16.10)

All modules have hands-on material that will be given to attendees at the beginning of the training.

tomaztk / workshop-spark-data-engineers

Workshop: Spark for Data Engineers

About

Languages