lucmski / sentences-similarity-cluster

Calculate similarity of sentences & Cluster the result.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sentences-similarity-cluster

sensim_cluster calculates the similarity of text data(from file) using Levenshtein distance and clusters(hierarchical clustering) the result. Clustering results are displayed with dendrogram.

Usage

  1. Prepare your data file
  2. Run this program below
# -*- coding: utf-8 -*-
import sys
from sensim_cluster.sensim_cluster import SensimCluster
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram

cluster = SensimCluster('YOUR_DATAFILE_PATH')
ids = cluster.get_ids()
result = cluster.ward()
mod_ids = [id[-6:] for id in ids]
r = dendrogram(result, p=100, truncate_mode='lastp', labels=mod_ids, leaf_rotation=90)
print(r['leaves'])
print(r['ivl'])
plt.ylim(ymin=-10.0)
plt.show()

Docker-Compose

# build from docker-compose.yml
docker-compose build

# run container "app"
docker-compose run app

# kill container
docker-compose kill

# delete container
docker-compose rm

sentences-similarity-cluster (Old Version)

sim_cluster.py calculates the similarity of text data(from file) using Levenshtein distance and clusters(hierarchical clustering) the result. Clustering results are displayed with dendrogram.

Usage

1. Prepare your data file
2. Execute
python sim_cluster.py your_file

Example

1. Prepare the data file

./data/dummydata.csv

A,helloworld
B,hallawerld
C,helldwoody
D,hallowarld
E,galloworld
F,herroworld

2. Execute

python sim_cluster.py ./data/dummydata.csv

3. Result

result

[['A', 'helloworld'], ['B', 'hallawerld'], ['C', 'helldwoody'], ['D', 'hallowarld'], ['E', 'galloworld'], ['F', 'herroworld']]
['A', 'B', 'C', 'D', 'E', 'F']
['helloworld', 'hallawerld', 'helldwoody', 'hallowarld', 'galloworld', 'herroworld']
n0, n0 : 0
n0, n1 : 3
n0, n2 : 4
n0, n3 : 2
n0, n4 : 2
n0, n5 : 2
n1, n0 : 3
n1, n1 : 0
n1, n2 : 6
n1, n3 : 2
n1, n4 : 3
n1, n5 : 5
n2, n0 : 4
n2, n1 : 6
n2, n2 : 0
n2, n3 : 6
n2, n4 : 6
n2, n5 : 6
n3, n0 : 2
n3, n1 : 2
n3, n2 : 6
n3, n3 : 0
n3, n4 : 2
n3, n5 : 4
n4, n0 : 2
n4, n1 : 3
n4, n2 : 6
n4, n3 : 2
n4, n4 : 0
n4, n5 : 4
n5, n0 : 2
n5, n1 : 5
n5, n2 : 6
n5, n3 : 4
n5, n4 : 4
n5, n5 : 0
-------------------------
matrix: [[0, 3, 4, 2, 2, 2], [3, 0, 6, 2, 3, 5], [4, 6, 0, 6, 6, 6], [2, 2, 6, 0, 2, 4], [2, 3, 6, 2, 0, 4], [2, 5, 6, 4, 4, 0]]
-------------------------
[[  3.           4.           3.           2.        ]
 [  1.           6.           4.2031734    3.        ]
 [  0.           5.           4.89897949   2.        ]
 [  7.           8.           7.57187779   5.        ]
 [  2.           9.          12.05542755   6.        ]]

About

Calculate similarity of sentences & Cluster the result.

License:MIT License


Languages

Language:Python 89.8%Language:Dockerfile 10.2%