url:
Input: an article(with or without title)
Output: a concised version of the article, which conclude the most significent part of the article.
Time_Consumption:about 10s
source of data: 1.Wikipedia Chinese Version 2.News Dataset
Desciption of Process:
- Use jieba to divded the datasets into words
- Train the word2vec model by processed data(300 dimension)
- Use SIF tech to generate the sentence vector
- a / a+p_w (a is a smooth varible, p_w is the frequency of the word)
- Use SVD to decrease the dimension
- Calculate the similarity of the sentence vector with the article vector
- Using Convolution and Knn to smooth the similarity.
- Choosing the top 20% sentences after smooth.
- Sort the choosen sentence by original order.