A python implementation of a gzip-based text classification system based on the algorithm described in "Less is More: Parameter-Free Text Classification with Gzip" by Zhiying Jiang, Matthew Y.R. Yang, Mikhail Tsirlin, Raphael Tang, and Jimmy Lin.
This project is in active development. Please don't get mad if it doesn't work.
pip install git+https://github.com/Sonictherocketman/gzip-classifier
Installation via PyPi is coming soon.
There are no dependencies! gzip-classifier
uses only the Python Standard Library.
There are two steps needed to use the classifier. First you need to train it using some pre-existing data. This data must be classified into a set of discrete labels which will be used to classify new data once the model is trained.
Once the model is trained it can be either exported and saved for later use OR it can be used immediately.
If you have any further questions, please refer to the tests for examples on using the classifier.
The training process requires two sets of data: a list of sample text snippets and a list of the labels that correspond to those snippets.
Consider this sample dataset:
label, text
-----------
science, Physicists discover new spacetime geometry
arts-culture, Famous author releases new book
world, War in <country> is now over!
These are fed into the classifier as shown here:
from gzip_classifier import Classifier
training_data = [...]
labels = [...]
classifier = Classifier()
classifier.train(training_data, labels)
# OR
with ParallelClassifier(processes=4) as classifier:
classifier.train(training_data, labels)
Once trained, the model can be used to classify new text.
For insight on the k parameter, please refer to this Wikipedia article. In short, the value depends on your data. Try playing around with integer values greater than one to see what best fits your data. I'm sure there are ways to determine the best value for this, but I don't know enough about it to tell you. I just played with it.
classifier.classify('some new text', k=20)
>> 'a label'
You can also do bulk classification using the following method:
samples = [
'some new text',
'more new text',
]
for label in classifier.classify_bulk(samples, k=20)
print(label)
>> 'a label'
>> 'another label'
The gzip classifier is highly parallelizable and so included in the package is a ParallelClassifier
that uses Python's built in multiprocessing pool to delegate training and classifying samples across cores.
The parallel classifier is best used as a context manager:
with ParallelClassifier(processes=4) as classifier:
classifier.train(training_data, labels)
sample = 'some text'
classifier.classify(sample, k=10)
Once trained, the model can be exported and saved for later use.
# Normal export
model_contents = classifier.model
with open('model.txt', 'w') as f:
f.write(model_contents)
# Compressed export
model_contents = classifier.compact_model
with open('model.gz', 'wb') as f:
f.write(model_contents)
You can easily load a model from disk like this:
# Normal import
with open('model.txt', 'r') as f:
classifier = Classifier.using_model(f.read())
# Compressed import
model_contents = classifier.compact_model
with open('model.gz', 'rb') as f:
classifier = Classifier.using_compact_model(f.read())
Run the tests using the following command.
python -m pytest tests/
Contributions are welcome. Just don't be a jerk.
This code is released under the MIT license. Use it for whatever.