theriley106 / Patient-Matching

LAHacks Patient Matching Challenge | Winner at the UCLA Hackathon in March 2020

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Patient-Matching

LAHacks Patient Matching Challenge

Online Patient Matching Dashboard

Explanation

We initially intended to take an ML approach to solving the problem, but as we started exploring the dataset we realized that an approach using a combination of soundex values and a weighted levenshtein distance could give us near perfect results.

We went through each column type and manually ranked values that would likely be misheard in a hospital setting. We then created a heuristic technique to match patients to each other based on the algorithm defined below. We were able to achieve 99.5% high accuracy on the given dataset (after changing some weights we are now unclear about the success rate, but it's consistently >92.5%)

We are using these weights:

{
    "PATIENT_ACCT_#": 1.00,
    "First": .930,
    "CURRENT_ZIP_CODE": .555,
    "CURRENT_STATE": .511,
    "CURRENT_STREET_1": .509,
    "DATE_OF_BIRTH": .503,
    "LAST_NAME": .491,
    "FIRST_NAME": .400,
    "MI": .310,
    "CURRENT_CITY": .303,
    "SEX": .100,
    "CURRENT_STREET_2": .100,
    "PREVIOUS_FIRST_NAME": .100,
    "PREVIOUS_MI": .100,
    "PREVIOUS_LAST_NAME": .100,
    "PREVIOUS_STREET_1": .100,
    "PREVIOUS_STREET_2": .100,
    "PREVIOUS_CITY": .100,
    "PREVIOUS_STATE": .100,
    "PREVIOUS_ZIP_CODE": .100,
}

and this psuedocode to compare rows:

Go over every row in the dataset and remove null values
calculate soundex score for the name of each patient

# This example is comparing row1 to row2

iterate over each column value each row
split each value on a space character and calculate the sum of levenshtein distances for each piece
divide the sum of levenshtein distances by the length of the word -> set as distanceScore

if distanceScore is within a certain threshold add the column weight to a combined score value
# our program has a default threshold of 0.2 

the relationship between the rows is now defined numerically
the highest value score is the most similar row

How to test

python calc_accuracy.py {csvfile.csv}

About

LAHacks Patient Matching Challenge | Winner at the UCLA Hackathon in March 2020


Languages

Language:CSS 63.0%Language:Jupyter Notebook 23.5%Language:HTML 7.7%Language:Python 2.9%Language:JavaScript 2.9%