hodgesmr / biden_nlp

Jupyter Notebook that introduces BIDEN: Binary Inference Dictionaries for Electoral NLP. It demonstrates a compression-based binary classification technique that is fast at both training and inference on common CPU hardware in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BIDEN: Binary Inference Dictionaries for Electoral NLP

BIDEN

This is a Jupyter Notebook that introduces BIDEN: Binary Inference Dictionaries for Electoral NLP. It demonstrates a compression-based binary classification technique that is fast at both training and inference on common CPU hardware in Python.

It is largely built on the strategies presented by FTCC, which in turn, was a reaction to Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors (the gzip method). Like FTCC, BIDEN is built atop of Zstandard (Zstd), which leverages dictionary compression. Zstd dictionary compression seeds a compressor with sample data, so that it can efficiently compress small data (~1 KB) of similar composition. Seeding the compressor dictionaries acts as our "training" method for the model.

The BIDEN model was trained on the ElectionEmails 2020 data set — a database of over 900,000 political campaign emails from the 2020 US election cycle. In compliance with the data set's terms, the training data is NOT provided with this repository. If you would like to train the BIDEN model yourself, you can request a copy of the data for free. The BIDEN model was trained on corpus_v1.0.

It also demonstrates success at fast partisan classification for tweets and samples from the campaign email database maintained by Derek Willis.

The idea of classification by compression is not new; Russell and Norvig wrote about it in 1995 in the venerable Artificial Intelligence: A Modern Approach:

Classification by data compression

More recently, the "gzip beats BERT" paper got a lot of attention. What the BIDEN model demonstrates is that this technique is effective and likely generalizable on modern partisan texts.

License

All code is provided under the BSD 3-Clause license.

A Matt Hodges project

This project is maintained by @MattHodges.

Please use it for good, not evil.

About

Jupyter Notebook that introduces BIDEN: Binary Inference Dictionaries for Electoral NLP. It demonstrates a compression-based binary classification technique that is fast at both training and inference on common CPU hardware in Python

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Jupyter Notebook 100.0%