nlputils

Utility scripts or libraries for various Natural Language Processing tasks.

List

charfreq.awk: calculate character frequency.
convcat.py: cat files with different encodings together.
csvcol.py: get specified columns of csv files.
csvsql.py: convert csv file to sql definition.
dbsort.tcl: sort SQLite tables in place.
detokenizer.py: detokenize Chinese text.
dump2db.py: make a database from leaked password dumps.
epubzhconv.py: Chinese varient conversion for epub books.
filtermd5.py: remove md5s not in known list.
findbadlines.py: find encoding errors in stdin.
gbk_pua.py: convert PUA codes in GBK to unicode.
getautodesk.py: get Moses format parallel text from Autodesk corpus.
gettxtcollection.py: merge a txt file collection to one large corpus.
haodoo: crawl and download all books from haodoo.net.
iconv.py: implements iconv.
iso639.json, iso639-to-calibre.py: get ISO639 codes from Wikipedia and convert to calibre po file.
jiebazhc: tokenize Classical Chinese using jieba.
libpinyin_bopomofo.py: Decorator to use with python-pinyin, to convert Pinyin to Bopomofo. (now useless)
ngramfreq.awk: calculate n-gram character frequency.
num2chinese.py: convert numbers to Chinese numbers.
phrasecombine.py: combine splitted words to large phrases given a dictionary.
pwdsort.js, zxcvbn.js: print out password strength according to zxcvbn.
pgexplaindot.py: output a GraphViz dot file for EXPLAIN (FORMAT JSON).
pgviewdep.tcl: output a GraphViz dot file representing view dependencies in a PostgreSQL database.
rmdup.c: remove duplicate lines without sort (compile with make, needs libxxhash-dev).
simpdump.py: try to find username, email, password and hash from leaked password dumps.
splitrecutfilter.py: reads stdin, filters non-chinese sentences and cuts sentences and words.
tatoeba: convert tatoeba dumps to a SQLite3 database.
wordfreq.awk: calculate word frequency.
WWStarClone.py: clone of WWStar, an ancient Classical Chinese translator.
zhutil.py: misc. utils for processing Chinese.
modelzh.json: model to detect Classical/Modern Chinese.

License

If not otherwise noted in file, all files are licensed under MIT License.

johnsonlu / nlputils

nlputils

List

License

About

Languages