BDProject

Part 1:

Tweets were collected in json format using the python code found in Part1/Collection. This code was found at https://stackoverflow.com/questions/45940984/python-twitter-stream-save-to-file

Hashtags and urls were extracted from the tweets using Apache Spark with both python and scala on different collections of tweets

Apache Hadoop and Apache Spark(python) hashtag/url word counts were run on the same data set where 'data' was the key term. Pyspark extraction and wordcount code is found under SparkWordCount/python. The log files for these runs can be found under logs/HadoopAndSparkPython

Apache Spark hashtag/url word counts were also run in scala on a different set of collected tweets where 'data' was also the key term. The extraction/wordcount code and log files for these runs can be found under SparkWordCount/scala

Output for all three runs can be found in the output subdirectories

Part 2:

Queries using Apache Spark on Twitter data with the keyword beer. Queries include:

Top brands by social media presence (tweet count, followers count, and favourites count)
Cities that tweet the most about beer
Beer tweet counts by date
Top beer brands by locality (USA vs Other)
Top favorited accounts that tweet about beer
Multimedia beer tweets
Top languages used in beer tweets besides English
Top mentioned beer brands
Beer tweet counts by hour
Distribution of tweets considered 'possibly sensitive'

About

Apache Spark queries on twitter data

Languages

Language:Jupyter Notebook 99.7%Language:Python 0.2%Language:Dockerfile 0.0%