codecraf8 / coding-challenge

Challenge Summary

This challenge is to implement two features:

Clean and extract the text from the raw JSON tweets that come from the Twitter Streaming API, and track the number of tweets that contain unicode.

A tweet's text is considered "clean" once all of the escape characters (e.g. \n, ", / ) are replaced and unicode have been removed.

Calculate the average degree of a vertex in a Twitter hashtag graph for the last 60 seconds, and update this each time a new tweet appears.

A Twitter hashtag graph is a graph connecting all the hashtags that have been mentioned together in a single tweet.

Running the Codebase Locally

git clone https://github.com/5ak3t/coding-challenge.git
Install requirements - pip install -r requirements.txt
Run Tests - python src/tests.py Test Fixtures are located in coding-challenge/src/fixtures
chmod +x coding-challenge/run.sh
./run.sh

Implementation Details

Common functions are written in coding-challenge/src/utils.py
Parsing of tweets is implemented in coding-challenge/src/tweets_cleaned.py
Calculating average degree is implemented in coding-challenge/src/average_degree.py
Cleaned tweets are written in coding-challenge/src/tweet_output/ft1.txt
Rolling average degree is written in coding-challenge/src/tweet_output/ft2.txt

TODO Improvements For Future Versions

Proper Fixtures for Tests
Improve Test Coverage
Implement Threading
Run the code against live streaming API
The Graph creation and updation can have a better implementation.
Apache Spark can be used to ingest realtime data, GraphX for Grpah processing.

About

Languages

Language:Python 97.7%Language:Shell 2.3%