This project has been realized on a 6 month period at the Illinois Institute of Technology. It has been supervised by Dr. Aron Culotta.
The goal of this project is to find a correlation between the well-being of a person and the fact that this person regularly exercises.
This project relies on machine learning techniques, it uses:
- the Twitter API to collect data on users,
- sport tracking apps (RunKeeper, Nike+, ...) to detect sportive users
This project is written in Python and uses the following packages:
- sklearn for the machine learning algorithms,
- matplotlib to draw graphs in Python,
- TwitterAPI to make requests to the Twitter API in Python,
- docopt to easily implement a command-line interface.
The installation process requires three steps:
- Install scikit-learn.
- Install matplotlib
- Clone this github repository and install as usual:
git clone https://github.com/vlandeiro/sporty-twitters.git
cd sporty-twitters
python setup.py install
This project has been developed with the idea that it could be tweaked by developers that want to adapt it to their need. To facilitate this, the sporty API has been developed so the different parts of this project can be integrated to existing code.
It is also convenient to have an easy way to run experiments from a terminal instead of having to code it every time using the API: that is the purpose of the Command-Line Interface (CLI).
The project is divided in three sub-APIs: users
, tweets
, and mood
. They are centralized in the sporty
API in order to simplify the instanciation for a developer.
The users
API centralizes all the actions related to the Twitter users. From a list of user identifiers (UID) L
, it is possible to:
- Collect the tweets of every user in
L
with a maximum of 3200; - Collect the UIDs of the friends of every user in
L
and save them on disk; - Load the user object of every user in
L
; - Load the friends (as user objects) of every user in
L
; - Get the user object of every user in
L
using the Twitter API; - Get the most similar friends of every user in
L
; - Label each user of
L
on gender.
The tweets
API focuses on the manipulation of tweet objects. The possible operations are the following:
- Load a corpus of tweets;
- Collect tweets using the Streaming API and store them on disk or on memory;
- Filter a corpus of tweets as follow: for each word in a list, it is going to keep
n
tweets that contain this word; - Label a corpus of tweets.
The mood
API is centered on the study of the users' mood. It contains all of the machine learning part of the project. Thanks to this API, one can:
- Expand a vocabulary using different methods: WordNet or Context Similarity;
- Build a features vector to pass to a classifier;
- Train a classifier;
- Classify tweets once the classifier is trained;
- Draw and save on disk the ROC curve for a classifier;
- Run a benchmark of a classifier using n-folds cross validation.
The CLI has been built using the docopt package and relies on the sporty API. It provides a lot of access to the API:
Usage: sporty-cli -h | --help
sporty-cli mood benchmark <labeled_tweets> [-bmptu] [-s SW] [-e E] [-k K]
[--min-df=M] [--n-folds=K] [--n-examples=N]
[--clf=C [--clf-options=O]] [--proba=P] [--roc=R]
[--reduce-func=R] [--features-func=F] [--liwc=L]
sporty-cli mood label <input_tweets> <labeled_tweets> [-l L]
sporty-cli mood predict_user <labeled_tweets> <users_dir> <user_ids_file>
[-bmptu] [-s SW] [-e E] [--liwc=L]
[--forbid=F] [--clf=C [--clf-options=O]]
[--proba=P] [--min-df=M] [--reduce-func=R]
[--features-func=F] [--sporty] [--poms=P]
sporty-cli mood match_users <sport_scores> <no_sport_scores> <user_match> [--rand=R]
sporty-cli tweets collect <settings_file> <output_tweets> <track_file>
[<track_file>...] [-c C]
sporty-cli tweets filter <input_tweets> <output_tweets> <track_file>
[<track_file>...] [-c C] [--each] [--no-rt]
sporty-cli users collect_tweets <settings_file> <user_ids_file>
<output_dir> [-c C]
sporty-cli users list_friends <settings_file> <user_ids_file> <output_dir>
sporty-cli users most_similar <user_ids_file> <users_dir> <friends_dir>
[--no-tweets]
sporty-cli users show <settings_file> <input_dir>
sporty-cli stream collect <settings_file> [--lang=L] [-c C]
Options:
-h, --help Show this screen.
--clf=C Classifier type to use for the task. Valid options
are 'logistic-reg', 'svm', 'decision-tree',
'naive-bayes', 'kneighbors'
--clf-options=O Options for the classifier as a string
representing a Python dictionary
--each Filter C tweets for each of the tracked words
--forbid=F Path to a file containing a list of forbidden
words. If a tweet contains any of these words,
it will not be used for the classification
task.
--lang=L Language of the tweets to collect [default: en]
--liwc=L Path to the LIWC dictionary
--min-df=M See min_df from sklearn vectorizers [default: 3]
--n-examples=N Number of wrongly classified examples to display
[default: 0]
--n-folds=K Number of folds for the cross validation
[default: 10]
--no-rt Remove retweets when filtering
--no-tweets Do not use the tweets of the users to infer their
location
--poms=P Path to the poms lexicon
--proba=P Classify a tweet as positive only if the
probability to be positive is greater than P [default: 0.5]
--sporty Flag to put when the users are expected to be exercising.
--rand=R Path to file containing the scores of the random users.
--roc=R Plot the ROC curve with R the test set size given
as a ratio (e.g. 0.2 for 20 percent of the data)
and return. Note: the benchmark is not run
-b, --binary No count of features, only using binary features
-c C, --count=C Number of tweets to collect/filter [default: 3200]
-e E, --emoticons=E Path to file containing the dictionary of emoticons
-f F, --features-func=F List of functions to execute amongst the functions
of the FeatureBuilder class. The functions of this
list will be executed in order
-k K, --k-features=K Number of features to keep during the features
selection [default: 160]
-l, --begin-line=L Line to start labeling the tweets [default: 0]
-m Keep mentions when cleaning corpus
-p Keep punctuation when cleaning corpus
-r R, --reduce-func=R Function that will be used to reduced the labels
into one general label (e.g. 'lambda x, y: x or y')
-s SW, --stopwords=SW Path to file containing the stopwords to remove
from the corpus
-t, --top-features Display the top features during the benchmark
-u Keep URLs when cleaning corpus
Note that the CLI is divided in four parts:
mood
to call functions from the Mood API,users
to call functions from the Users API,tweets
to call functions from the Tweets API,stream
to collect tweets by directly using the TwitterAPI module.