git@github.com:krithikabalu/CommunityPricingProject.git
Prerequisites
- Install pipenv -
pip install pipenv
(need only once, takes few minutes) python3 -m venv venv && source venv/bin/activate && pipenv install
Start Cluster
- Run
cluster/build-image.sh
to build hadoop image (Run if there are updates to image) - Run
cluster/start.sh
to start the hadoop cluster
Data Ingestion
- Run
db/start.sh
to start the postgres database- Optionally, to connect to psql command line:
psql -U postgres --host=localhost --db=pricing
- Optionally, to connect to psql command line:
- To perform data dump to postgres and subsequently convert to hdfs run
db/import.sh
- Run
db/stop.sh
to stop the postgres database outside the container
Data Processing
- Run
./run.sh
to run the spark job to generate the output - To view output
docker exec -it hadoop-master hadoop fs -copyToLocal /Output .
docker cp hadoop-master:/root/Output .
Data Visualization
0. Run docker exec -it hadoop-master bash
if you are outside the container
- Hive home directory -
cd $HIVE_HOME
- Run
hive
and execute the following statements sequentially,
-
CREATE EXTERNAL TABLE product_hdfs (product_id int, description string, cost string, markup string) STORED AS AVRO LOCATION '/user/root/product';
-
CREATE EXTERNAL TABLE product_es (id bigint, description string, cost float, markup float) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' ='pricing/product','es.nodes'= 'elasticsearch');
-
INSERT OVERWRITE TABLE product_es SELECT * FROM product_hdfs;
- Check imported data in elastic search: http://localhost:9200/pricing/_search
- Finally, create visualisation in Kibana: http://localhost:5601/
Errors/Resolutions
ERROR tool.ImportTool: Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoop-master:9000/user/root/product already exists
run
hadoop fs -rm -r /user/root/product