Spark Streaming Project with Hadoop, HBase, Spark Streaming, Spark SQL + Hive
(See How to build and run in the last page)
Twitter is known as social network. This project aims to get some insights from twitter feeds for researching purposes
Do you know how people talk about what you are thinking on twitter?
Your thoughts maybe about Bitcoin, Trump, upcoming Christmas or TM.
How many people talk about them on Twitter per second?
How much “Trump” is mentioned compare to “Bitcoin” on Twitter every minute?
Twitter API platform offers options for streaming their real-time Tweets.
Unzip the attachment, under root folder you can find commands, HiveQL script and source codes of all applications
You can also get the source code from github.com at https://github.com/binhtv/twitter-streaming for the latest updates
- Commands detail (please see how to build and run part)
- HiveQL script (please see how to build and run part)
- tweet folder (tweet app)
Read real time tweets from twitter by using Twitter4J library
Apache Kafka producer take the tweets to feed into Kafka topic
- -spark_streaming_eg folder: contains source code for streaming jobs
Using KafkaUtils to create a direct stream, subscribe to Kafka topic for getting data from Kafka producter (tweet app)
- spark_streaming_visualization folder
Contains source code for getting data from Hbase (with Hive table on top) for visualization by using Spark SQL
Then, data is broadcasted to web client app for visualization
- visualization_client folder
Contain source code for running web client app for visualization. This is a simple web application using NodeJs, Pubnub and Highchart
Pubnub subscribe to a channel to get data from this channel
Highchart is javascript library for visualization data with various type of charts and graphs
Make sure your latest Kafka service installed, you can download at https://kafka.apache.org/ and start manually
You also need to install zookeeper service , or you can skip this step if you use zookeeper from another service such as HBase or Hadoop
Make sure Hadoop, Hbase (master & region server), Hive are installed and working properly in your machine
Make sure latest NodeJs is installed on your machine
$ bin/kafka-server-start.sh config/server.properties
$ bin/kafka-topics.sh --create --topic sparktest --partitions 1 --replication-factor 1 --zookeeper localhost:2181
$ sudo cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/
Execute the following scripts in hive_tweet.sql
After running you should see as the following in Hue
From hbase shell, you can also see the description of tables by describe 'table_name'
$ mvn clean install
$ java -cp target/tweet-0.0.1-SNAPSHOT.jar producer.tweet.App 500 trump,bitcoin,football,snow,rain,soccer,winter,iphone
$ mvn clean install
$ spark-submit --class "c523.spark_streaming_eg.SparkStreaming" --master local target/spark_streaming_eg-0.0.1-SNAPSHOT.jar localhost:9092 trump,bitcoin,football,snow,iphone
$ mvn clean install
$ spark-submit --class "c523.spark_streaming_eg.SparkSql" --master local target/spark_streaming_visualization-0.0.1-SNAPSHOT.jar
$ node app.js
From browser (Chrome, Firefox, Safari, new IE) go to http://localhost:3000
You should see as below: