smartsammler / pyconde2018

Build a modern data infrastructure

Build a modern data infrastructure

Data is the new oil 🛢️ So you want to build a data driven company or create machine learnings models, but guess what? You need a proper data infrastructure to support it! During this workshop you will learn how to create one using OSS tools and the best practices that could fit both batch and stream processing.

Description

During this tutorial you will learn how to create a scalable and reliable data infrastructure using OSS libraries.

Given the limited amount of time we will focus mainly on the batch side but some useful insights will be shared about stream processing and OLAP.

The first part will introduce the main concepts while the second will be focused on building stuff. Tutorial outline

First part

Why bother?
Unified data warehouse
Stream VS Batch (Fast VS Slow data)
Kafka? I'd rather use Redis
Short term storage VS Long term storage
Airflow

Second part

Build a Python consumer
Use Airflow to move things around
Use Airflow to train machine learning models
What's next?

Tools that we are going to use

Redis streams
Scylla
Airflow
Docker
Pandas and Parquet
Python ❤️

Prerequisites

If I say clone this repo locally and download these docker images you don't freak out
A good knowledge of Python

About

Build a modern data infrastructure

Languages

Language:Python 87.2%Language:Dockerfile 12.8%