koaning / akin

Some text similarity utilities

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

akin

Some text similarity utilities

The goal of akin is to make it easy to sort text based on numeric similarity.

Install

You can install this tool via pip.

python -m pip install "akin @ git+https://github.com/koaning/akin.git"

Usage

The simplest way to use this tool is to just use it to sort texts.

from akin import sort_dataframe

# Let's load in a csv file that has a text column named "text". 
dataf = pd.read_csv("data.csv")
# Let's sort this dataframe such that we prefer examples with texts
# that are similar to the examples in the line below.
dataf.pipe(sort_dataframe, examples=["very nice", "super positive"], text_col="text")

In this basic setting, we're really just using CountVectors from scikit-learn to compute the similarity between two texts based on bag of word counts. We could go a bit more fancy though by using word embeddings from whatlies. Our library supports any embedding, as long as it's implemented with the scikit-learn API in mind.

from whatlies.language import BytePairLanguage

bp_lang = BytePairLanguage("en")

dataf.pipe(sort_dataframe, 
           examples=["very nice", "super positive"], 
           text_col="text", 
           featurizer=bp_lang)

While the sorting will likely cover most activated labelling use-cases, you may also want an object that's a bit more flexible. For that you may use the AkinClassifier.

import pandas as pd
from akin import AkinClassifier

examples = {
    "positive": ["thanks so much", "compliment", "i like this!"],
    "negative": ["this stinks", "you suck"],
}
akin = AkinClassifier(examples=examples)
df = pd.read_csv("<some>/<file>.csv")

# Calculate distances for the original dataframe
akin.assign_distances(df)

# Predict a single item
akin.predict_single(text="thanks, that's nice of you")

# Construct a generator that yields the {text, distances} dictionary for each item
g = akin.pipe(df["text"])
next(g)

Warning

I like to build in public but I should stress that this is a repo made for utility for myself. Honestly, it's made in a quick evening. Feel free to re-use, but don't expect maintenance or production-quality code in the long term.

About

Some text similarity utilities

License:MIT License


Languages

Language:Python 97.2%Language:Makefile 2.8%