Thomson-reuters-News-ETL-pipeline-and-DB-Setup

Set Up:

Spawn an AWS EMR 6.3.1 cluster - Big data stack should have Spark.
Spawn a m5.xlarge / m5.2xlarge Ubuntu18.04 VM on AWS.

Prerequisites Setup on AWS EMR:

ssh to AWS EMR cluster master node using
'ssh -i xx.pem hadoop@ip-address'
Copy pipeline_prerequisites.sh to this node, as well as the private key to Remote server where the CSVs are located.
Run 'chmod +x pipeline_prerequisites.sh'
Run './pipeline_prerequisites.sh'
Change Driver memory in spark-deafults.conf to 15GB to avoid OOM errors in file '/usr/lib/spark/conf/spark-defaults.conf'.

Prerequisites to Setup ElasticSearch DB and configure table schema on m5.xlarge / m5.2xlarge VM

ssh to VM
Copy 'elastic.sh' to the VM
Run 'chmod +x elastic.sh'
Run './elastic.sh'

Airflow Scheduling:

'airflow_dag.py' automates fetching tar files from Remote server, untar them and run pyspark application @daily.
Make sure sqlite3 has version > 3.15.0. AWS EMR default image may have older sqlite version.
Steps to Run:

Copy airflow_dag.py, file.py / completeCSVetlFile.py to AWS EMR cluster at '/home/hadoop/airflow_dag.py' and 'home/hadoop/file.py'.

Set these variables: ip = 'ip-address-remote-server' pvt_key_name = '/location/to/private-key.pem' user = 'username' fileName = '2013-07-01.csv.gz'

Run the file by 'python airflow_dag.py'

Pyspark application: 'file.py' processes partial CSV and 'completeCSVetlFile.py' processes complete CSV. It expects the jar is untar-ed and sitting in the same folder.

To run either:

Copy this file to AWS EMR master node at '/home/hadoop/file.py'.
Setup these varaibles:
"es.nodes", "x.x.x.x" //public-ip of ES Node
"es.port" , "9200"
"es.resource","thomreuters/2013-07-01
CSV file Name - '2013-07-01.csv' on line 18
Run 'spark-submit --master yarn --deploy-mode client file.py'

This pyspark file does:

It will read csv file and record the index where <“date”,”time”> pattern is matching using regex.
It will divide the csv file into text blobs based on the above indices and convert it into partitioned dataFrame.
The partitioned dataFrame will go through an UDF parser, which will parse each text blob and convert it into a structured format Hash(19 fields+ 1 field for shard routing region).
The partitioned DataFrames are brought back to driver executor where the “headline”, “text” fields are converted to English Language using Spark-NLP.
The resultant dataFrame is saved to Elasticsearch into ThomReuters/ table

This application takes time to run. Keep patience!

PostProcessed Output looks like:

RESTAPI Search POST Request:

Rastogii / Thomson-reuters-News-ETL-pipeline-and-DB-Setup

Thomson-reuters-News-ETL-pipeline-and-DB-Setup

About

Languages