turian / simple-twitter-similarity

Didactic example of information retrieval, computing the similarity of two twitter users

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

twittersimilarity
-----------------

    by Joseph Turian

USAGE:
    simple-twitter-similarity.py [options] user1 user2

    Options:
      -h, --help            show this help message and exit
      -p PREPROCESSING, --preprocessing=PREPROCESSING
                            'string.split' or 'pytextpreprocess' [default:
                            string.split]

METHODOLOGY:
    * repr(u1) is representation of user u1.

    * We compute and return sim(repr(u1), repr(u2)). A standard choice of sim
    in information retrieval is the Cosine similarity:
        http://en.wikipedia.org/wiki/Cosine_distance

    * repr(u) = sum_{tweet t in user u's timeline} repr(t)
    i.e. we do an unweighted combination of each of the user's tweets
    to get the user's representation. If tweets were of significantly
    different length (instead of up to 160 characters per), it might
    make sense to downweight each tweet's repr by its length, so that
    longer tweets had the same effect on the user representation as
    shorter tweets.

    * repr(t)[w] = 0 if w is in the stoplist at the top of
    simple-twitter-similarity.py, and the word count of w in preprocess(t)
    otherwise.

    preprocess(t) either does simple tokenization (word splitting),
    or can use a more sophisticated preprocessing module---like
    pytextpreocess---with stemming, lowercasing, and more complete
    stop-word removal.

    A better term-document representation would use the tf-idf score
    instead of the word count, so that rarer words are more highly
    weighted. tf-idf is also more common in the IR literature. However,
    this would require acquiring IDF scores over a large corpus,
    which I leave as an exercise to the reader. (Hint: Scraping, not
    computing the scores, is the most programming intensive part.) One
    could also consider BM25 scores, which are widely considered better
    than tf*idf scores.

REQUIREMENTS:
    * python-twitter:
        easy_install python-twitter
    http://code.google.com/p/python-twitter/
    (This package, in turn, requires simplejson)

    * pytextpreprocess [optional]:
        http://github.com/turian/pytextpreprocess
    For more sophisticated text preprocessing.

NOTES:
    * This will only work on public timelines. We don't do authentication.

    * We only use as many twitter updates as GetUserTimeline returns,
    with count=200. In the future, one might want to keep reading the
    user's timeline until it is exhausted, the get the ENTIRE timeline.

    * Smoothing is a technique whereby we improve recall. There are
    several ways we could smooth the data:
        * By default, we preprocess the tweets solely using
        string.split. This preprocessing will lead to poor recall,
        since it does not handle variations in inflection, case, and
        punctuation. More sophisticated preprocessing could improve
        recall, and would include lowercasing, stemming, and improved
        tokenization. If you have package pytextpreprocess installed,
        you can get sophisticated preprocessing by using program option:
            --preprocessing=pytextpreprocess

        * Every time we see a twitter username referenced, we could
        take the representation of that referenced user and mix it in
        with the representation of the referencing user. In particular,
        every word uttered by the referenced user could be normalized to
        a probability distribution. Referncing this user is equivalent
        to uttering the referenced user's probability distribution.

About

Didactic example of information retrieval, computing the similarity of two twitter users