Challenge Summary
This challenge is to implement two features:
Clean and extract the text from the raw JSON tweets that come from the Twitter Streaming API, and track the number of tweets that contain unicode.
A tweet's text is considered "clean" once all of the escape characters (e.g. \n, ", / ) are replaced and unicode have been removed.
Calculate the average degree of a vertex in a Twitter hashtag graph for the last 60 seconds, and update this each time a new tweet appears.
A Twitter hashtag graph is a graph connecting all the hashtags that have been mentioned together in a single tweet.
Running the Codebase Locally
-
Install requirements -
pip install -r requirements.txt
-
Run Tests -
python src/tests.py
Test Fixtures are located incoding-challenge/src/fixtures
-
chmod +x coding-challenge/run.sh
-
./run.sh
Implementation Details
-
Common functions are written in
coding-challenge/src/utils.py
-
Parsing of tweets is implemented in
coding-challenge/src/tweets_cleaned.py
-
Calculating average degree is implemented in
coding-challenge/src/average_degree.py
-
Cleaned tweets are written in
coding-challenge/src/tweet_output/ft1.txt
-
Rolling average degree is written in
coding-challenge/src/tweet_output/ft2.txt
TODO Improvements For Future Versions
-
Proper Fixtures for Tests
-
Improve Test Coverage
-
Implement Threading
-
Run the code against live streaming API
-
The Graph creation and updation can have a better implementation.
-
Apache Spark can be used to ingest realtime data, GraphX for Grpah processing.