smartsammler / pyconde2018

Build a modern data infrastructure

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build a modern data infrastructure

Data is the new oil 🛢️ So you want to build a data driven company or create machine learnings models, but guess what? You need a proper data infrastructure to support it! During this workshop you will learn how to create one using OSS tools and the best practices that could fit both batch and stream processing.

Description

During this tutorial you will learn how to create a scalable and reliable data infrastructure using OSS libraries.

Given the limited amount of time we will focus mainly on the batch side but some useful insights will be shared about stream processing and OLAP.

The first part will introduce the main concepts while the second will be focused on building stuff. Tutorial outline

First part

  • Why bother?
  • Unified data warehouse
  • Stream VS Batch (Fast VS Slow data)
  • Kafka? I'd rather use Redis
  • Short term storage VS Long term storage
  • Airflow

Second part

  • Build a Python consumer
  • Use Airflow to move things around
  • Use Airflow to train machine learning models
  • What's next?

Tools that we are going to use

  • Redis streams
  • Scylla
  • Airflow
  • Docker
  • Pandas and Parquet
  • Python ❤️

Prerequisites

  • If I say clone this repo locally and download these docker images you don't freak out
  • A good knowledge of Python

About

Build a modern data infrastructure


Languages

Language:Python 87.2%Language:Dockerfile 12.8%