houqp / aws-concurrent-data-orchestration-pipeline-emr-livy

This code demonstrates the architecture featured on the AWS Big Data blog (https://aws.amazon.com/blogs/big-data/ ) which creates a concurrent data pipeline by using Amazon EMR and Apache Livy. This pipeline is orchestrated by Apache Airflow.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AWS Concurrent Data Orchestration Pipeline EMR Livy

This code demonstrates the architecture featured on the AWS Big Data blog (https://aws.amazon.com/blogs/big-data/ ) which creates a concurrent data pipeline by using Amazon EMR and Apache Livy. This pipeline is orchestrated by Apache Airflow.

Description of the project folders

cloudformation

This folder contains the cloudformation template that spins up the Airflow infrastructure.

dags/airflowlib

This folder contains reusable code for Amazon EMR and Apache Livy.

dags/transform

This folder contains sample transformation scala code which transforms the movielens data files from csv to parquet.

dags/movielens_dag.py

This script contains the code for the DAG definition. It basically defines the Airflow pipeline.

License

This library is licensed under the Apache 2.0 License.

About

This code demonstrates the architecture featured on the AWS Big Data blog (https://aws.amazon.com/blogs/big-data/ ) which creates a concurrent data pipeline by using Amazon EMR and Apache Livy. This pipeline is orchestrated by Apache Airflow.

License:Apache License 2.0


Languages

Language:Python 90.3%Language:Scala 9.7%