jina-ai / executor-tagshasher

TagsHasher

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TagsHasher

Inspired by the idea of FeatureHasher, TagsHasher converts arbitrary .tags into a fixed-length vector.

Note that, unlike FeatureHashser, you should only use Jaccard/Hamming distance when searching documents embedded via TagsHasher. This is because the closeness of the value on each feature is meaningless, as the value is the result of a hash function. Whereas in FeatureHashser's example, the value represents the term frequency of a word.

Hence, in TagsHasher only identity value in the embedded vector matters.

This demo requires jina>=2.2.5, if you encounter error then try latest Jina master.

Example

I will keep everything out of the Flow to make it clear:

import io

from jina import Document, DocumentArray, Executor
from jina.types.document.generators import from_csv

# Load some online CSV file dataset
src = Document(
    uri='https://perso.telecom-paristech.fr/eagan/class/igr204/data/film.csv'
).convert_uri_to_text('iso8859')
da = DocumentArray(from_csv(io.StringIO(src.text), dialect='auto'))

# use TagsHasher to encode data
th = Executor.from_hub('jinahub://TagsHasher')
th.encode(da)

# build some filters
filters = [
    {"Subject": "Comedy"},
    {"Year": 1987},
    {"Subject": "Comedy", "Year": 1987}
]

# build Documents from filters
qa = DocumentArray([Document(tags=f) for f in filters])

# and then use TagsHasher to encode them
th.encode(qa)

# do match, show top-5, notice the usage of Jaccard here. It requires scipy as jaccard is not natively supported by Jina
qa.match(da, limit=5, exclude_self=True, metric='jaccard', use_scipy=True)

# print
for d in qa:
    print('my filter is:', d.tags.json())
    for m in d.matches:
        print(m.tags.json())
    input()

my filter is: {
  "Subject": "Comedy"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Chase, Chevy",
  "Actress": "",
  "Awards": "No",
  "Director": "",
  "Length": "",
  "Popularity": "82",
  "Subject": "Comedy",
  "Title": "Valkenvania",
  "Year": "1990"
}
{
  "*Image": "paulNewman.png",
  "Actor": "Newman, Paul",
  "Actress": "",
  "Awards": "No",
  "Director": "",
  "Length": "",
  "Popularity": "28",
  "Subject": "Comedy",
  "Title": "Secret War of Harry Frigg, The",
  "Year": "1968"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Murphy, Eddie",
  "Actress": "",
  "Awards": "No",
  "Director": "",
  "Length": "",
  "Popularity": "56",
  "Subject": "Comedy",
  "Title": "Best of Eddie Murphy, Saturday Night Live, The",
  "Year": "1989"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Mastroianni, Marcello",
  "Actress": "",
  "Awards": "No",
  "Director": "Fellini, Federico",
  "Length": "",
  "Popularity": "29",
  "Subject": "Comedy",
  "Title": "Ginger & Fred",
  "Year": "1993"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Piscopo, Joe",
  "Actress": "",
  "Awards": "No",
  "Director": "",
  "Length": "60",
  "Popularity": "14",
  "Subject": "Comedy",
  "Title": "Joe Piscopo New Jersey Special",
  "Year": "1987"
}


my filter is: {
  "Year": 1987.0
}
{
  "*Image": "NicholasCage.png",
  "Actor": "",
  "Actress": "Madonna",
  "Awards": "No",
  "Director": "",
  "Length": "50",
  "Popularity": "75",
  "Subject": "Music",
  "Title": "Madonna Live, The Virgin Tour",
  "Year": "1987"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Piscopo, Joe",
  "Actress": "",
  "Awards": "No",
  "Director": "",
  "Length": "60",
  "Popularity": "14",
  "Subject": "Comedy",
  "Title": "Joe Piscopo New Jersey Special",
  "Year": "1987"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Everett, Rupert",
  "Actress": "",
  "Awards": "No",
  "Director": "",
  "Length": "95",
  "Popularity": "25",
  "Subject": "Drama",
  "Title": "Hearts of Fire",
  "Year": "1987"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Lambert, Christopher",
  "Actress": "Sukowa, Barbara",
  "Awards": "No",
  "Director": "Cimino, Michael",
  "Length": "",
  "Popularity": "41",
  "Subject": "Drama",
  "Title": "Sicilian, The",
  "Year": "1987"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Hubley, Whip",
  "Actress": "",
  "Awards": "No",
  "Director": "Rosenthal, Rick",
  "Length": "98",
  "Popularity": "87",
  "Subject": "Action",
  "Title": "Russkies",
  "Year": "1987"
}


my filter is: {
  "Subject": "Comedy",
  "Year": 1987.0
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Piscopo, Joe",
  "Actress": "",
  "Awards": "No",
  "Director": "",
  "Length": "60",
  "Popularity": "14",
  "Subject": "Comedy",
  "Title": "Joe Piscopo New Jersey Special",
  "Year": "1987"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Murphy, Eddie",
  "Actress": "",
  "Awards": "No",
  "Director": "Murphy, Eddie",
  "Length": "90",
  "Popularity": "51",
  "Subject": "Comedy",
  "Title": "Eddie Murphy Raw",
  "Year": "1987"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "McCarthy, Andrew",
  "Actress": "Cattrall, Kim",
  "Awards": "No",
  "Director": "Gottlieb, Michael",
  "Length": "",
  "Popularity": "23",
  "Subject": "Comedy",
  "Title": "Mannequin",
  "Year": "1987"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Williams, Robin",
  "Actress": "",
  "Awards": "No",
  "Director": "Levinson, Barry",
  "Length": "120",
  "Popularity": "37",
  "Subject": "Comedy",
  "Title": "Good Morning, Vietnam",
  "Year": "1987"
}
{
  "*Image": "NicholasCage.png",
  "Actor": "Boys, The Fat",
  "Actress": "",
  "Awards": "No",
  "Director": "Schultz, Michael",
  "Length": "86",
  "Popularity": "69",
  "Subject": "Comedy",
  "Title": "Disorderlies",
  "Year": "1987"
}

Process finished with exit code 0

About

TagsHasher

License:Apache License 2.0


Languages

Language:Python 100.0%