Shirlly / Min_hash_Incremental_Clustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Min_hash_Incremental_Clustering

Cluster text data based on a combination of min_hash clustering and incremental clustering. By applying min_hash clustering, near duplicate text could be identified efficiently.

Input data:

  • Each line in the input file is considered as one document to be cluseterd.
  • Format: A &#& B &#& Text &#& D
  • Can change the input data format and delimiter accordingly.

Output data:

  • Same sequence as input data and associated with its corresponding cluster label
  • Can save cluster elements as well.

About


Languages

Language:Java 100.0%