A Linguistic Stegonographic System. Inspired by smart people. Hacked by @etchin and @zankuda.
- KenLM
- Installed according to this tutorial to directory
kenlm
- Installed according to this tutorial to directory
make tweets
The idea of this is to generate a language model of how people write tweets. With a sufficiently large number of Tweets (filtered of useless ones) we can deal generate this model using the KenLM package.
- Use the Twitter API (via the Technica Demo to download tweets of a bunch of tweeters. Dump these CSV files into data/
- Process all of these tweet CSV files into a single flat text file
python compile_tweets.py
- Process all of the Tweets into a
.arpa
language model filecat data/all_tweets.txt | python data/process.py | ./kenlm/bin/lmplz -o 3 > data/tweets.arpa
- Convert the textual language model into a binary blob
./kenlm/bin/build_binary data/tweets.arpa data/tweets.klm
make parse_ppdb
and make build_rules
Download some PPDB databases from Download a PPDB file from the right website.
- Generate the PPDB database from PPDB web files
python2 ppdb/parser.py ppdb/ppdb-1.0-m-o2m ppdb/o2m.parse
python2 ppdb/parser.py ppdb/ppdb-1.0-m-lexical ppdb/lexical.parse
- Convert the PPDB database rules to
ppdb/rules.db
python build_ppdb.py
- Parse and canonicalize tweets of a person.
- this is necessary to build the proper language model
- language model is used to score each possible cover tweet
- best scoring covers are presented to user as possible tweets
- each cover tweet is associated with a hash, which is used to communicate a message
- Tweet -> Cover
- for a given tweet a user is about to write, generate possible cover tweets