Facebook Posts Analyzer

Facebook Posts Analysis - inspirations

-For a single account, What the most common words used in all of the posts
-For a single account, how many posts posted each year?
-For a single account, what are the most commented posts?
-What are the most liked posts?
-Who are my ‘best friends’?

Development Environment

-OS : mac OS high Sierra Version 10.13.4
-Hadoop 2.6.5
-Apache Spark 2.6
-Hive 2.3.1
-Python 3.6.4
-Kafka 1.1.0

Install libs for Python

pip install facebook
pip install pandas
pip install plotly
pip install matplotlib
pip install pyspark
pip install numpy
pip install kafka

Data Flow

-Using Facebook Graph API and Http to request real-timing data
-Receiving and parsing data
-Sending data to Kafka
-Using Spark Streaming to read data from Kafka
-Saving data to Hive
-Using Spark SQL to query and analysis
-Using Plotly to visualize data

How does it work

1.Firstly, you need to be sure all the components are correctly installed and started, mainly focus on kafka and hive.
-Try to create a topic to test if kafka is working
#create a topic
bash /usr/local/Cellar/kafka/1.1.0/bin/kafka-topics --create --zookeeper localhost: 9092 --replication-factor 1 --partitions 1 --topic bgposts
#read msg
/usr/local/Cellar/kafka/1.1.0/bin/kafka-console-consumer --bootstrap-server localhost: 9092 --topic bgposts
#produce msg
/usr/local/Cellar/kafka/1.1.0/bin/kafka-console-producer --broker-list localhost:9092 --topic bgposts
-To test hive is working type 'hive' on the command window, see if it runs properly and run 'show tables' see if it can show tables in the database default

2.My kafka topic is 'bgposts', Run the BGConsumer.py script
When it's running, it gets data from kafka. Since the spark streaming batch interval is set to 10s, the interval analysis result is stored to hive for further analysis. In the mean time, we extract data from hive and proceed further analysis using spark SQL, then visualizing the final result by Plotly. In our main program, the spark SQL will analysis the history data, and update the data visualization every 10s with the latest data.
The main program BGConsumer.py need to be initialized by spark
$ spark-submit --jars /usr/local/Cellar/kafka/1.1.0/libexec/libs/spark-streaming-kafka-assembly_2.11-1.6.1.jar /..../BGConsumer.py

3.Run the BGProducer.py script, it requests data from facebook and sends them to kafak
$$ python code/BGProducer.py

4.Virtualization Results
output\

sunling / Big-Data-Facebook-Posts-Analyzer