Add `similartiy` utility.
koaning opened this issue · comments
vincent d warmerdam commented
Something like this:
import numpy as np
from sklearn.metrics import pairwise_distances
from embetter.utils import similarity
def calc_distances(inputs, anchors, pipeline, anchor_pipeline=None, metric="cosine", aggregate=np.max, n_jobs=None):
"""
Shortcut to compare a sequence of inputs to a set of anchors.
The available metrics are: `cityblock`,`cosine`,`euclidean`,`haversine`,`l1`,`l2`,`manhattan` and `nan_euclidean`.
You can read a verbose description of the metrics [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics).
Arguments:
- inputs: sequence of inputs to calculate scores for
- anchors: set/list of anchors to compare against
- pipeline: the pipeline to use to calculate the embeddings
- anchor_pipeline: the pipeline to apply to the anchors, meant to be used if the anchors should use a different pipeline
- metric: the distance metric to use
- aggregate: you'll want to aggregate the distances to the different anchors down to a single metric, numpy functions that offer axis=1, like `np.max` and `np.mean`, can be used
- n_jobs: set to -1 to use all cores for calculation
"""
X_input = pipeline.transform(inputs)
if anchor_pipeline:
X_anchors = anchor_pipeline.transform(anchors)
else:
X_anchors = pipeline.transform(anchors)
X_dist = pairwise_distances(X_input, X_anchors, metric=metric, n_jobs=n_jobs)
return aggregate(X_dist, axis=1)
vincent d warmerdam commented
Then the Prodigy recipe might use something like:
from prodigy.sorters import ExpMovingAverage, prefer_low_scores
def make_scored_stream(stream, anchors):
for batch in batched(stream):
batch_text = [b['text'] for b in batch]
distances = calc_distance(batch_text, anchors, pipeline)
for score, ex in zip(distances, batch):
yield score, ex
def sorted_stream(stream):
return prefer_low_scores(ExpMovingAverage(stream))
Worth rethinking though. Something about recalculating the anchors feels a bit wasteful.