NLP_proj1

url:

Input: an article(with or without title)

Output: a concised version of the article, which conclude the most significent part of the article.

Time_Consumption:about 10s

source of data: 1.Wikipedia Chinese Version 2.News Dataset

Desciption of Process:

Use jieba to divded the datasets into words
Train the word2vec model by processed data(300 dimension)
Use SIF tech to generate the sentence vector
1. a / a+p_w (a is a smooth varible, p_w is the frequency of the word)
2. Use SVD to decrease the dimension
Calculate the similarity of the sentence vector with the article vector
Using Convolution and Knn to smooth the similarity.
Choosing the top 20% sentences after smooth.
Sort the choosen sentence by original order.

About

Language:Jupyter Notebook 94.2%Language:Python 3.8%Language:HTML 1.9%Language:JavaScript 0.0%