kooku0 / Text-Similarity-Analysis

text similarity analysis by konlpy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text-Similarity-Analysis

Text similarity Analysis by TF-IDF Algorithm.

visualization using seaborn heapmap.

TF-IDF Algorithm

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. (wikipedia)

Requirements

  • Python 3.x.x
  • knolpy
  • JPype

Getting Started

You need to install JPype to use the morpheme analyzer.

JPype is an effort to allow python programs full access to java class libraries.

Test

I will test with several articles

pip install newspaper3k

1st article: http://v.media.daum.net/v/20171215130602344 (bitcoin related)

2nd article: http://v.media.daum.net/v/20171215130312300 (bitcoin related)

3rd article: http://v.media.daum.net/v/20171215111203921 (bitcoin related)

4th article: http://v.media.daum.net/v/20171216002700566 (weather related)

5th article: http://v.media.daum.net/v/20171215214505350 (weather related)

from newspaper import Article

url_list = ['http://v.media.daum.net/v/20171215130602344',
            'http://v.media.daum.net/v/20171215130312300',
            'http://v.media.daum.net/v/20171215111203921',
            'http://v.media.daum.net/v/20171216002700566',
            'http://v.media.daum.net/v/20171215214505350']
for url in url_list:
    article = Article(url, langague='ko')
    article.download()
    article.parse()

    okt_nouns = ' '.join(okt.nouns(article.text))
    mydoclist_okt.append(okt_nouns)
    
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix_okt = tfidf_vectorizer.fit_transform(mydoclist_okt)
document_distances_okt = (tfidf_matrix_okt * tfidf_matrix_okt.T)
print(document_distances_okt.toarray())

Result

[[1.         0.98024875 0.95977812 0.         0.00650447]
 [0.98024875 1.         0.9691473  0.         0.00647298]
 [0.95977812 0.9691473  1.         0.         0.006262  ]
 [0.         0.         0.         1.         0.3313002 ]
 [0.00650447 0.00647298 0.006262   0.3313002  1.        ]]

visualization using seaborn heapmap.

Reference

About

text similarity analysis by konlpy


Languages

Language:Python 100.0%