Some text similarity utilities
The goal of akin is to make it easy to sort text based on numeric similarity.
You can install this tool via pip.
python -m pip install "akin @ git+https://github.com/koaning/akin.git"
The simplest way to use this tool is to just use it to sort texts.
from akin import sort_dataframe
# Let's load in a csv file that has a text column named "text".
dataf = pd.read_csv("data.csv")
# Let's sort this dataframe such that we prefer examples with texts
# that are similar to the examples in the line below.
dataf.pipe(sort_dataframe, examples=["very nice", "super positive"], text_col="text")
In this basic setting, we're really just using CountVectors from scikit-learn to compute the similarity between two texts based on bag of word counts. We could go a bit more fancy though by using word embeddings from whatlies. Our library supports any embedding, as long as it's implemented with the scikit-learn API in mind.
from whatlies.language import BytePairLanguage
bp_lang = BytePairLanguage("en")
dataf.pipe(sort_dataframe,
examples=["very nice", "super positive"],
text_col="text",
featurizer=bp_lang)
While the sorting will likely cover most activated labelling use-cases, you
may also want an object that's a bit more flexible. For that you may use
the AkinClassifier
.
import pandas as pd
from akin import AkinClassifier
examples = {
"positive": ["thanks so much", "compliment", "i like this!"],
"negative": ["this stinks", "you suck"],
}
akin = AkinClassifier(examples=examples)
df = pd.read_csv("<some>/<file>.csv")
# Calculate distances for the original dataframe
akin.assign_distances(df)
# Predict a single item
akin.predict_single(text="thanks, that's nice of you")
# Construct a generator that yields the {text, distances} dictionary for each item
g = akin.pipe(df["text"])
next(g)
I like to build in public but I should stress that this is a repo made for utility for myself. Honestly, it's made in a quick evening. Feel free to re-use, but don't expect maintenance or production-quality code in the long term.