cr21 / GithubArchiveIngestor

Aws Lambda to Ingest Data Incrementally from Github Archive Website

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GithubArchiveIngestor

  • Lambda function Demonstration using Container Image, This Lambda function achieve following task

  • Very Basic high level overview of this lambda function

  1. Lambda function Ghactivity_ingestor Get Data from GHArchive, save all json files to s3 table (landing/ghactivity).
  2. Lambda function Ghactivity_transfomer will be triggered automatically on S3 PUT Event (landing/ghactivity)
  3. Lambda function Ghactivity_transfomer read all json files convert into parquet file and store on S3 table raw/ghactivity/
  4. Create Glue Crawler to crawl incremental data to Athena table and run adhoc queries

Alt text

  • Create Lambda function in AWS console, set following Environment variable Run :
    
    
    Goto AWS CONSOLE SET ENVIRONMENT VAR Based on Need
    
    BUCKET_NAME : <YOUR_VAL>
    FOLDER: <YOUR_VAL>
    JOB_ID: <YOUR_VAL>
    SOURCE_FOLDER : <YOUR_VAL>
    TGT_FOLDER: <YOUR_VAL>
    JOB_ID_1: <YOUR_VAL>
    PYTHONPATH:/var/task/app
    
    

About

Aws Lambda to Ingest Data Incrementally from Github Archive Website

License:MIT License


Languages

Language:Python 58.5%Language:Shell 38.6%Language:Dockerfile 2.9%