data-science deduplication python dedupe

Data-deduplication

About

This is the Model for data deduplication challenge,which identifies unique patients from dataset by applying machine learning algorithms like clustering as well as logistic regression with help of python library dedupe. It takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

Installation

Install python and pip according to your system with the guide available here
git clone https://github.com/agarwalgaurav811/Data-deduplication && cd Data-deduplication
pip install -r requirements.txt
pip install -e .

Running Instructions

python main.py

A file named "Deduplication output.csv" will be created in the data directory with a new column called 'Cluster ID' which indicates which records refer to each other.

About

Model for data deduplication assignment.

data-science deduplication python dedupe

Languages

Language:Python 100.0%