A fun little library for determining duplicate contacts. There are both normal and advanced datasets for you to play with in the /data
directory.
Prerequisites
- Ensure you have the current version of Node.js or the latest LTS version. This was tested on version
10.11
.
Simply clone
or fork
the repository then run:
npm install
in the root of the repository.
npm start
NOTE: main.js
in the root directory is the entry-point. This is so ES6 node works properly (thank you esm!)
npm test
'nuff said.
- This algorithm is NOT suited for non-business related contact de-duplication. It makes assumptions based on the data structure that the contacts won't be people living at the same address with different first names that share a phone number and email address (e.g. older married couples, or families with land-lines).
- Further work could be done on data cleaning, testing, validation, and edge case checking
- Performance will definitely start to degrade if you have more than 3k - 4k nameAddress keys and you start to have lots of obscure, and kind of similar names. This is really due to the distance function. Frankly optimizing the search around alphabetical order may help resolve that.
- the nameAddress keys should likely be generated up-front at parsing time by the contact itself. That would be a good/simple refactor which would make things easier to read and faster to execute.
- Metaphone 3 is likely a better/faster way of handling the names, however it's ~$260 for a license right now
- The duplicateContactProvider could use some additional CQS work to help clean up the logic