HMNI

Fuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.

HMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.

Model	Accuracy	Precision	Recall	F1-Score
HMNI-Latin	0.9393	0.9255	0.7548	0.8315

For an introduction to the methodology and research behind HMNI, please refer to my blog post.

Requirements

Python 3.5–3.8

tensorflow
scikit-learn
fuzzywuzzy
abydos
unidecode

QUICK USAGE GUIDE

Installation

Using PIP via PyPI

pip install hmni

Initialize a Matcher Object

import hmni
matcher = hmni.Matcher(model='latin')

Single Pair Similarity

matcher.similarity('Alan', 'Al')
# 0.6838303319889133

matcher.similarity('Alan', 'Al', prob=False)
# 1

matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133

Record Linkage

import pandas as pd

df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})

merged = matcher.fuzzymerge(df1, df2, how='left', on='name')

Name Deduplication and Normalization

names_list = ['Alan', 'Al', 'Al', 'James']

matcher.dedupe(names_list, keep='longest')
# ['Alan', 'James']

matcher.dedupe(names_list, keep='frequent')
# ['Al, 'James']

matcher.dedupe(names_list, keep='longest', replace=True)
# ['Alan, 'Alan', 'Alan', 'James']

Matcher Parameters

hmni.Matcher(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)

model (str) -- HMNI statistical model (latin by default)
prefilter (bool) -- Should the matcher prefilter unlikely candidates (True by default)
allow_alt_surname (bool) -- Should the matcher consider phonetic matching surnames e.g. Smith, Schmidt (True by default)
allow_initials (bool) -- Should the matcher consider names with initials (True by default)
allow_missing_components (bool) -- Should the matcher consider names with missing components (True by default)

Matcher Methods

similarity(name_a, name_b, prob=True, surname_first=False)

name_a (str) -- First name for comparison
name_b (str) -- Second name for comparison
prob (bool) -- If True return a predicted probability, else binary class label
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
surname_first (bool) -- If name strings start with surname (False by default)

fuzzymerge(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)

df1 (pandas DataFrame or named Series) -- First/Left object to merge with
df2 (pandas DataFrame or named Series) -- Second/Right object to merge with
how (str) -- Type of merge to be performed
- inner (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
- left: Use only keys from left frame, similar to a SQL left outer join; preserve key order
- right: Use only keys from right frame, similar to a SQL right outer join; preserve key order
- outer: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
on (label or list) -- Column or index level names to join on. These must be found in both DataFrames
left_on (label or list) -- Column or index level names to join on in the left DataFrame
right_on (label or list) -- Column or index level names to join on in the right DataFrame
indicator (bool) -- If True, adds a column to output DataFrame called “_merge” with information on the source of each row (False by default)
limit (int) -- Top number of name matches to consider (1 by default)
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
allow_exact_matches (bool) -- If True allow merging on exact name matches, else do not consider exact matches (True by default)
surname_first (bool) -- If name strings start with surname (False by default)

dedupe(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)

names (list) -- List of names to dedupe
threshold (float) -- Prediction probability threshold for positive match (0.5 by default)
keep (str) -- Specifies method for keeping one of multiple alternative names
- longest (default): Keeps longest name
- frequent: Keeps most frequent name in names list
reverse (bool) -- If True will sort matches descending order, else ascending (True by default)
limit (int) -- Top number of name matches to consider (3 by default)
replace (bool) -- If True return normalized name list, else return deduplicated name list (False by default)
surname_first (bool) -- If name strings start with surname (False by default)

assign_similarity(name_a, name_b, score)

name_a (str) -- First name for similarity score assignment
name_b (str) -- Second name for similarity score assignment
score (float) -- Assigned similarity score for pair of names

Contributing

Pull requests are welcome. For developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic), jupyter notebooks are shared in the dev folder to build models using similar methods.

License

MIT

smrutil / hmni