This is the Model for data deduplication challenge,which identifies unique patients from dataset by applying machine learning algorithms like clustering as well as logistic regression with help of python library dedupe. It takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.
- Install python and pip according to your system with the guide available here
git clone https://github.com/agarwalgaurav811/Data-deduplication && cd Data-deduplication
pip install -r requirements.txt
pip install -e .
python main.py
A file named "Deduplication output.csv" will be created in the data directory with a new column called 'Cluster ID' which indicates which records refer to each other.