name-probability

This repo implements the disambiguation methodology outlined in "How Unique and Traceable are Usernames?" to link users across platforms. While the paper is interested in usernames, I've typically used it as an additional feature in record linkage tasks -- for example, linking campaign contributions to employment data.

Usage

>>> from NameProbability import NameMatcher
>>> name_list_src = '#LOCATION OF NAME LIST FILE' # or use sample_names.csv in data directory
>>> # for custom name list, expects text file with each row containing string for a person's name
>>> # currently only been tested with "first last" or "last, first" name formats
>>> nameprob = NameMatcher(name_list_location=name_list_src, last_comma_first=True)
>>> nameprob.probSamePerson('john smith', 'john r smith')
>>> 0.008288431595531668
>>> nameprob.probSamePerson('zubin jelveh', 'zubin r jelveh')
>>> 0.999999999999234634

Installation

python setup.py install

Edit Operation Probability

In order to compute P(u_1 | u_2) -- the probability person A uses name one given that person A uses name two -- we have to compute the probability of each edit operation that takes us from u_1 to u_2. The current implementation does this empirically by taking a sample of 50,000 names and counting the occurrence of each type of edit operation. Room for improvement here.

About

Tool for assessing the uniqueness of names (full or user). Calculate the probability that a name will appear in a population or that two different names belong to the same person.

Languages

Language:Python 100.0%