Chinese-Document-Clustering-using-K-means-and-word2vec
Document Clustering is a method for finding structure within a collection of documents, so that similar documents can be grouped into categories. This is an Unsupervised grouping of text documents into meaningful groups, usually representing topics in the document collection
Develop tools and techniques
- Python
- Pycharm
- Foxpro
Difficulties
- Unstructured data: Document is composed by words
- How to capture the semantic in the document
- How to encode the document: Let computer can process them
- Chinese word is different from English: Maybe two or three words concatenate together be the smallest semantic unit
Experiment Flow chart
Raw data preprocessing
- Use Foxpro to save documents
- Split documents into sentences
- Remove alpha and number, only preserve chinese
- Use jeibar to do sentense segmentation
Word2vec(genism,python 3.6)
- Using the Skip-gram model: use a word to predict a target context
- The method is a single hidden layer neural network
- The Input is one-hot word vector
- The output is the corresponding word around the input word depend on the window size
- The first trained parameter matrix is the dictionary we want(every word will get a vector to represent themselves)
- Dimension: 250
- Window size: 5
- Get a word dictionary which contains 23,767 vectors from 2,999,179 tokens
Document vector
- I get the document vector by taking the average of all word vector in a document
- Use PCA to reduce the dimension
Clustering
- K means
- Use silhouette analysis to find how many cluster number is better