binhtv / twitter-streaming

This project aims to get some insights from twitter feeds for researching purposes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spark Streaming Project with Hadoop, HBase, Spark Streaming, Spark SQL + Hive

(See How to build and run in the last page)

Project idea

Twitter is known as social network. This project aims to get some insights from twitter feeds for researching purposes

Do you know how people talk about what you are thinking on twitter?

Your thoughts maybe about Bitcoin, Trump, upcoming Christmas or TM.

How many people talk about them on Twitter per second?

How much “Trump” is mentioned compare to “Bitcoin” on Twitter every minute?

Data set

Twitter API platform offers options for streaming their real-time Tweets.

Source code and technology

Source code

Unzip the attachment, under root folder you can find commands, HiveQL script and source codes of all applications

image

You can also get the source code from github.com at https://github.com/binhtv/twitter-streaming for the latest updates

  • Commands detail (please see how to build and run part)
  • HiveQL script (please see how to build and run part)
  • tweet folder (tweet app)

Read real time tweets from twitter by using Twitter4J library

Apache Kafka producer take the tweets to feed into Kafka topic

image

  • -spark_streaming_eg folder: contains source code for streaming jobs

Using KafkaUtils to create a direct stream, subscribe to Kafka topic for getting data from Kafka producter (tweet app)

image

  • spark_streaming_visualization folder

Contains source code for getting data from Hbase (with Hive table on top) for visualization by using Spark SQL

Then, data is broadcasted to web client app for visualization

image

  • visualization_client folder

Contain source code for running web client app for visualization. This is a simple web application using NodeJs, Pubnub and Highchart

Pubnub subscribe to a channel to get data from this channel

Highchart is javascript library for visualization data with various type of charts and graphs

Source code structure image

Components and flows

Components and flows image

How to build and run

Make sure your latest Kafka service installed, you can download at https://kafka.apache.org/ and start manually

You also need to install zookeeper service , or you can skip this step if you use zookeeper from another service such as HBase or Hadoop

Make sure Hadoop, Hbase (master & region server), Hive are installed and working properly in your machine

Make sure latest NodeJs is installed on your machine

Step by step to run:

In Kafka folder, Start Kafka service

$ bin/kafka-server-start.sh config/server.properties

Create a Kafka topic

$ bin/kafka-topics.sh --create --topic sparktest --partitions 1 --replication-factor 1 --zookeeper localhost:2181

Copy hive-site configuration file: Make hive and spark can work together

$ sudo cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/

At hive shell, create hive tables: tweet_counts and tweet_words

Execute the following scripts in hive_tweet.sql

image

After running you should see as the following in Hue

image

From hbase shell, you can also see the description of tables by describe 'table_name'

image image

At ROOT source code folder, go inside tweet folder

$ mvn clean install

$ java -cp target/tweet-0.0.1-SNAPSHOT.jar producer.tweet.App 500 trump,bitcoin,football,snow,rain,soccer,winter,iphone

image

At ROOT source code folder, go inside spark_streaming_eg folder

$ mvn clean install

$ spark-submit --class "c523.spark_streaming_eg.SparkStreaming" --master local target/spark_streaming_eg-0.0.1-SNAPSHOT.jar localhost:9092 trump,bitcoin,football,snow,iphone

image

At ROOT source code folder, go inside spark_streaming_visualization folder

$ mvn clean install

$ spark-submit --class "c523.spark_streaming_eg.SparkSql" --master local target/spark_streaming_visualization-0.0.1-SNAPSHOT.jar

image

At ROOT source code folder, go inside visualization_client folder

$ node app.js

image

Test your work

From browser (Chrome, Firefox, Safari, new IE) go to http://localhost:3000

You should see as below:

image

About

This project aims to get some insights from twitter feeds for researching purposes


Languages

Language:Java 58.8%Language:JavaScript 17.2%Language:CSS 15.6%Language:HTML 5.0%Language:Shell 3.4%