benedeki / nba_enricher

Gather NBA players and their stats, then scan's tweets ametnioning top players and add extra info into these.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NBA Enricher

version 1.0 RC

A small (mostly) Python project to gather NBA related statistics, find the best player 9based on chosen criteria) and eventually scan Twitter for mentions of these players. Then the tweets are enriched with additional info about the players.

Required packages and other requirements

Postgres DB server (tested with 9.5, should work with 9+)

Twitter developer account to be able to use Twitter API

Python (tested with 3.6, effort made to make it 2.7 compatible)

Python Packages

  • requests
  • tweepy
  • psycopg2

Configuration

  1. Install or choose Postgres Server
  2. Create or choose existing database (let's call it nbadb)
  3. Create user nba on the server and add him CONNECT and CREATE privileges to the database from step #2 (nbadb)
  4. run script deploy.sh from the DB directory (deploy.sh --host=localhost --dbname=nbadb --username=nba --password=???); on Windows WSL can be used
  5. Change DB_CONNECTION in src/configuration.py to reflect the database set up
  6. Add Twitter API keys and secrets into TWITTER_CONNECTION in src/configuration.py
  7. Change any other configuration in src/configuration.py according to your privileges

Run

Logical steps

  1. Get players
  2. Get players' stats
  3. Identify the top players
  4. Gather the tweets
  5. Enrich the tweets matching
  6. Output the enriched tweets
  • Execute run_01_get_players.py (can be run repeatedly)
  • Execute run_02_get_player_stats.py (can be run repeatedly)
  • Start run_03_enrich_tweets.py

Tests

  • Execute runt_tests.py

##Highlights

  • both statistics gathering and Tweet scanning steps are created as multi-threaded
  • the threads communicate via command queue(s)
  • Tweet scanning for multiple different string occurrences is done using Aho-Corasick algorithm, which searches a text wiht one pass, replacement is then another, two in total (in case of at least one hit)
  • Database part is implemented as a service, not tightly sewn in into the program
  • Key parts that can be expected to change or be enhanced are clearly separated to allow easy alternation (statistics gathering, best players criteria, enriching rules, output of the enriched tweets, ...)

Known issues and TODOs

  • Aho Corasick - use smarter result accumulation, so no sorting is needed (use cache of size equal to longest searched word)
  • Some stats are back-computed from per game stats, better source would be more prices (points, time player, shots)
  • Increased robustness in calling NBA API (retries, thread recreation)
  • More tests
  • Components are tight together somewhat closely which complicates testing
  • Deploy script can be much more sophisticated
  • Tweets scanning/enriching could be multi-processed instead of multi-threaded
  • Add threads to initial players load (step 1)

... and Beyond

The enriching part (03) was designed to offer flexibility first in the way of output (method _output in class TweetEnricher), the way what are the actual enriching rules (the replacement dictionary coming out of players_to_enriching function in enriching.py file) and finally the possibility to relatively easily switch the design to multi-process or even multi-service design in case the enriching becomes CPU intensive or just for scaling reasons.

Also the database is used as juts another service in the multi-service architecture of this small application, together with NBA and Twitter. The Python code communicate with the database via API only, no direct access to the db (SELECTs, INSERTs etc.). This allows to hide from the application the DB implementation details, actual data structures can morph, and the database can scale and offer redundancy without any changes to the application. Thanks to this approach the database can also - as any other service - also expand it's API to offer richer service, with just some care not to break the past contracts.

About

Gather NBA players and their stats, then scan's tweets ametnioning top players and add extra info into these.


Languages

Language:Python 69.6%Language:PLpgSQL 27.6%Language:Shell 2.7%