Spark Streaming Project with Hadoop, HBase, Spark Streaming, Spark SQL + Hive

(See How to build and run in the last page)

Project idea

Twitter is known as social network. This project aims to get some insights from twitter feeds for researching purposes

Do you know how people talk about what you are thinking on twitter?

Your thoughts maybe about Bitcoin, Trump, upcoming Christmas or TM.

How many people talk about them on Twitter per second?

How much “Trump” is mentioned compare to “Bitcoin” on Twitter every minute?

Data set

Twitter API platform offers options for streaming their real-time Tweets.

Source code and technology

Source code

Unzip the attachment, under root folder you can find commands, HiveQL script and source codes of all applications

You can also get the source code from github.com at https://github.com/binhtv/twitter-streaming for the latest updates

Commands detail (please see how to build and run part)
HiveQL script (please see how to build and run part)
tweet folder (tweet app)

Read real time tweets from twitter by using Twitter4J library

Apache Kafka producer take the tweets to feed into Kafka topic

-spark_streaming_eg folder: contains source code for streaming jobs

Using KafkaUtils to create a direct stream, subscribe to Kafka topic for getting data from Kafka producter (tweet app)

spark_streaming_visualization folder

Contains source code for getting data from Hbase (with Hive table on top) for visualization by using Spark SQL

Then, data is broadcasted to web client app for visualization

visualization_client folder

Contain source code for running web client app for visualization. This is a simple web application using NodeJs, Pubnub and Highchart

Pubnub subscribe to a channel to get data from this channel

Highchart is javascript library for visualization data with various type of charts and graphs

Source code structure

Components and flows

Components and flows

How to build and run

Make sure your latest Kafka service installed, you can download at https://kafka.apache.org/ and start manually

You also need to install zookeeper service , or you can skip this step if you use zookeeper from another service such as HBase or Hadoop

Make sure Hadoop, Hbase (master & region server), Hive are installed and working properly in your machine

Make sure latest NodeJs is installed on your machine

Step by step to run:

In Kafka folder, Start Kafka service

$ bin/kafka-server-start.sh config/server.properties

Create a Kafka topic

$ bin/kafka-topics.sh --create --topic sparktest --partitions 1 --replication-factor 1 --zookeeper localhost:2181

Copy hive-site configuration file: Make hive and spark can work together

$ sudo cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/

At hive shell, create hive tables: tweet_counts and tweet_words

Execute the following scripts in hive_tweet.sql

After running you should see as the following in Hue

From hbase shell, you can also see the description of tables by describe 'table_name'

At ROOT source code folder, go inside tweet folder

$ mvn clean install

$ java -cp target/tweet-0.0.1-SNAPSHOT.jar producer.tweet.App 500 trump,bitcoin,football,snow,rain,soccer,winter,iphone

At ROOT source code folder, go inside spark_streaming_eg folder

$ mvn clean install

$ spark-submit --class "c523.spark_streaming_eg.SparkStreaming" --master local target/spark_streaming_eg-0.0.1-SNAPSHOT.jar localhost:9092 trump,bitcoin,football,snow,iphone

At ROOT source code folder, go inside spark_streaming_visualization folder

$ mvn clean install

$ spark-submit --class "c523.spark_streaming_eg.SparkSql" --master local target/spark_streaming_visualization-0.0.1-SNAPSHOT.jar

At ROOT source code folder, go inside visualization_client folder

$ node app.js

Test your work

From browser (Chrome, Firefox, Safari, new IE) go to http://localhost:3000

You should see as below:

binhtv / twitter-streaming

Project idea

Data set

Source code and technology

Source code

Components and flows

How to build and run

Step by step to run:

In Kafka folder, Start Kafka service

Create a Kafka topic

Copy hive-site configuration file: Make hive and spark can work together

At hive shell, create hive tables: tweet_counts and tweet_words

At ROOT source code folder, go inside tweet folder

At ROOT source code folder, go inside spark_streaming_eg folder

At ROOT source code folder, go inside spark_streaming_visualization folder

At ROOT source code folder, go inside visualization_client folder

Test your work

About

Languages