schardong / music_lsh_duplicates

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MinHash lyric deduplication

Music lyrics deduplication using MinHash Locality Sensitive Hashing

This project implements a deduplication technique using Locality Sensitive Hashing.

Requirements

This project requires python 3, plus the additional packages listed below:

  • package dataskecth
  • package editdistance
  • package pickle

These packages may be installed using your OS package manager or pip3

pip3 install datasketch editdistance pickle -U

Order of the scripts

First the crawler scripts must be run in order to obtain some data to process.

References

Crawler

Python MinHash LSH

About

License:MIT License


Languages

Language:Python 100.0%