KaterynaD / aws_data_pipeline_samples

Few AWS Data Pipeline samples to demo export from MS SQL to a file in S3 bucket, load a DynamoDB table to Redshift, multiple dependencies in the flow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AWS Data Pipeline Samples

MsSqlRdsToS3Template is a template to connect to AWS RDS MS SQL and export data to a file in an S3 bucket. To run this template you need to upload sqljdbc4.jar driver to an S3 bucket. The driver can be found here: https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774

DynamoDBTableToRedshiftTemplate loads data from a DynamoDB table to a Redshift table.

MutilpleDependencies is an example when Action1 and Action2 can be run in parallel, but Action3 should wait for the completion both actions.

SupportDW_Create is an example of a Support Data Warehouse creation from historical files saved in an S3 bucket. The process loads D_Calendar, D_Priorities, Products, Analysts, Cases and Logs tables in parallel. At the next stage, D_Products dimensional table is created and the products hierarchy is flatten in the load SQL script, D_Analysts dimensional table is created and analysts historical data are loaded in the slowly changing dimension type 2. F_Cases (fact table) is created and loaded at the last stage, when dimensional data are ready (surrogate keys are used) and logs are available to calculate times a case spent in each status.

SupportDW_Update is a similar to SupportDW_Create but at the first stage the data are loaded from an application tables in MS SQL and logs data are loaded from a DynamoDB table.

Installation

  1. If you never used AWS Data Pipeline before you need to create AWS IAM roles to run the samples using AWS CLI.


    $> aws datapipeline create-default-roles

  2. Create the pipelineId by calling the aws data pipeline create-pipeline command.


    $> aws datapipeline create-pipeline --name MsSqlRdsToS3Template --unique-id MsSqlRdsToS3Template

    You will receive a pipelineId like this.


    {
    "pipelineId": "df-078827623PVY9KS3XNLM"
    }

  3. Download the MsSqlRdsToS3Template.json sample pipeline definition and adjust parameters values in the file to your environment. Or you can provide your parameter values in aws data pipeline put-pipeline-definition command.
  4. Upload and validate your pipeline definition by calling the aws data pipeline put-pipeline-definition command.


    $> aws datapipeline put-pipeline-definition --pipeline-id df-078827623PVY9KS3XNLM --pipeline-definition file://MsSqlRdsToS3Template.json

    If your pipeline definition is valid you will receive a message like this. Otherwise, correct the file and repeat the command.
    {
    "validationErrors": [],
    "errored": false,
    "validationWarnings": []
    }

  5. Activate the pipeline by calling the aws datapipeline activate-pipeline command. This will cause the pipeline to start running.


    $> $> aws datapipeline activate-pipeline --pipeline-id df-078827623PVY9KS3XNLM

    There is no output in this command

  6. Check the status of your pipeline


    >$ aws datapipeline list-runs --pipeline-id df-078827623PVY9KS3XNLM

About

Few AWS Data Pipeline samples to demo export from MS SQL to a file in S3 bucket, load a DynamoDB table to Redshift, multiple dependencies in the flow