scikit-learn
wrappers for Python fastText
.
>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
Contents
Dependencies:
numpy
scipy
scikit-learn
fastText
Python package
pip install skift
NOTE: Installing skift
will not install fasttext
itself, as a the official Python bindings are not currently maintaned on PyPI.
To install the version of fasttext
(and its official Python bindings) which skift
is tested against, run:
pip install git+https://github.com/facebookresearch/fastText.git@ca8c5face7d5f3a64fff0e4dfaf58d60a691cb7c
- Adheres to the
scikit-learn
classifier API, includingpredict_proba
. - Also caters to the common use case of
pandas.DataFrame
inputs. - Enables easy stacking of
fastText
with other types ofscikit-learn
-compliant classifiers. - Pickle-able classifier objects.
- Built around the official fasttext Python bindings.
- Pure python.
- Supports Python 3.5+.
- Fully tested.
fastText
works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText
classifier use a single column as input, ignoring other columns. This is especially true when fastText
is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.
skift
includes several scikit-learn
-compatible wrappers (for the official fastText
Python bindings) which cater to these use cases.
NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the fastText.train_supervised
method on every call to fit
.
These wrappers do not make additional assumptions on input besides those commonly made by scikit-learn
classifies; i.e. that input is a 2d ndarray
object and such.
FirstColFtClassifier
- An sklearn classifier adapter for fasttext that takes the first column of inputndarray
objects as input.
>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
IdxBasedFtClassifier
- An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing theinput_ix
parameter to the constructor.
>>> from skift import IdxBasedFtClassifier
>>> df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])
>>> sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)
>>> sk_clf.fit(df[['count', 'txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
These wrappers assume the X
parameters given to fit
, predict
, and predict_proba
methods is a pandas.DataFrame
object:
FirstObjFtClassifier
- An sklearn adapter for fasttext using the first column ofdtype == object
as input.
>>> from skift import FirstObjFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstObjFtClassifier(lr=0.2)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
ColLblBasedFtClassifier
- An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing theinput_col_lbl
parameter to the constructor.
>>> from skift import ColLblBasedFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
Package author and current maintainer is Shay Palachy (shay.palachy@gmail.com); You are more than welcome to approach him for help. Contributions are very welcomed.
Clone:
git clone git@github.com:shaypal5/skift.git
Install in development mode, including test dependencies:
cd skift
pip install -e '.[test]'
To also install fasttext
, see instructions in the Installation section.
To run the tests use:
cd skift
pytest
The project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow these conventions.
Additionally, if you update this README.rst
file, use python setup.py checkdocs
to validate it compiles.
Created by Shay Palachy (shay.palachy@gmail.com).