agarwalgaurav811 / Data-deduplication

Model for data deduplication assignment.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data-deduplication

About

This is the Model for data deduplication challenge,which identifies unique patients from dataset by applying machine learning algorithms like clustering as well as logistic regression with help of python library dedupe. It takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

Installation

  • Install python and pip according to your system with the guide available here
  • git clone https://github.com/agarwalgaurav811/Data-deduplication && cd Data-deduplication
  • pip install -r requirements.txt
  • pip install -e .

Running Instructions

python main.py

A file named "Deduplication output.csv" will be created in the data directory with a new column called 'Cluster ID' which indicates which records refer to each other.

About

Model for data deduplication assignment.


Languages

Language:Python 100.0%