Text Classification with Word Vectors
- cd into Repository
- wget http://nlp.stanford.edu/data/glove.6B.zip
- unzip glove.6B.zip
- pip install pandas sklearn gensim tensorflow keras bs4
- nltk.download('punkt')
- nltk.download('stopwords')
- python reduction_algo.py [embedding_file] [reduced_dimensions] (e.g. python reduction_algo glove.300d.txt 150)
- e.g. python reduction_algo glove.300d.txt 150 --> the reduced embeddings will be saved in reduced_embeddings_150.txt
- svc.py creates the document vectors + gives results
- e.g. python svc.py glove.300d.txt 300
- svc_reuters.py creates and evaluates the document vectors
- e.g. python svc_reuters.py glove.300d.txt 300
- Run the Word2VecModel_on_Newsgroup.py and Word2VecModel_on_Reuters.py files
- Embedding files will be created, use them just as pre-trained vectors for evaluation
- e.g Word2VecModel_Newsgroup.py 400 -> created embedding_on_newsgroup_400.txt
Embedding | 20Newsgroup | Reuters |
---|---|---|
Glove-300D | 60 | |
Glove-200D | 53 | |
Glove-100D | 50 | |
Glove-Reduced-150D | 51 | |
Glove-Reduced-100D | 42 | |
Glove-Reduced-50D | 36 | |
Fasttext-300D | ||
Fasttext-Reduced-150D | ||
Word2Vec-300D | ||
Word2Vec-Reduced-150D | ||
W2V-Newsgroup-300D | 73 (0.7379182156133829) | x |
W2V-Newsgroup-200D | 0.6736590546999469 | x |
W2V-Newsgroup-400D | 0.7124269782262347 | x |
W2V-Newsgroup-Reduced-150D | 60 (0.6023632501327668) | x |
W2V-Newsgroup-Reduced-100D | x | |
W2V-Newsgroup-Reduced-200D | 0.6427243759957515 | x |
W2V-Reuters-300D | x | 41 (0.4121083377588954) |
W2V-Reuters-200D | x | |
W2V-Reuters-100D | x | |
W2V-Reuters-Reduced-150D | x | 32 (0.3252788104089219) |
W2V-Reuters-Reduced-100D | x | |
W2V-Reuters-Reduced-50D | x |