MediaUncovered / NewsAnalysis

use word embeddings to uncover bias in newspapers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

compare and decide for word embedding implementation

Tilana opened this issue · comments

based on the literature research about general word embeddings #2 wordRank and word2vec are interesting to investigate and compare. Based on that the way of storing and reading the data (#4) might differ...

Gensim Word2Vec: http://radimrehurek.com/gensim/models/word2vec.html

Loading data: Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…
https://rare-technologies.com/word2vec-tutorial/

Also data streaming in python: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Tensorflow Word2Vec: https://www.tensorflow.org/tutorials/word2vec
no data streaming possible?

DeepLearning4j Word2Vec: https://deeplearning4j.org/word2vec#just
Implementation for Java...
SentenceIterator/DocumentIterator: Used to iterate over a dataset. A SentenceIterator returns strings and a DocumentIterator works with inputstreams.