The article:
https://medium.com/@adriensieg/text-similarities-da019229c894
Methods studied in the article
-
Jaccard Similarity
-
Different embeddings+ K-means
-
Different embeddings+ Cosine Similarity
-
Word2Vec + Smooth Inverse Frequency + Cosine Similarity
-
Different embeddings+LSI + Cosine Similarity
-
Different embeddings+ LDA + Jensen-Shannon distance
-
Different embeddings+ Word Mover Distance
-
Different embeddings+ Variational Auto Encoder (VAE)
-
Different embeddings+ Universal sentence encoder
-
Different embeddings+ Siamese Manhattan LSTM
-
Knowledge-based Measures
SOURCES
-
[Incredible !!!] : https://github.com/nlptown/nlp-notebooks
- An Introduction to Word Embeddings
- Data exploration with sentence similarity
- Discovering and Visualizing Topics in Texts with LDA (en français !)
- Keras sentiment analysis with Elmo Embeddings
- Multilingual Embeddings - 1. Introduction
- Multilingual Embeddings - 2. Cross-lingual Sentence Similarity
- Multilingual Embeddings - 3. Transfer Learning
- NLP with pretrained models - spaCy and StanfordNLP
- Named Entity Recognition with Conditional Random Fields
- Sequence Labelling with a BiLSTM in PyTorch [sequence labelling tasks such as part-of-speech tagging or named entity recognition]
- Simple Sentence Similarity [Word Mover’s Distance + Smooth Inverse Frequency + InferSent + Google Sentence Encoder + Pearson correlation]
- Text classification with BERT in PyTorch
- Text classification with a CNN in PyTorch
- Traditional text classification with Scikit-learn [ELI5]
- Updating spaCy’s Named Entity Recognition System
-
[Tutorial WMD in jupyter notebook] : https://github.com/makcedward/nlp/blob/master/sample/nlp-word_mover_distance.ipynb
-
[Word Mover Distance] : https://www.kaggle.com/ankitswarnkar/word-embedding-using-glove-vector
-
[lstm-gru-sentiment-analysis] : https://github.com/javaidnabi31/Word-Embeddding-Sentiment-Classification
https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604
-
[Learning Word Embedding (Mathematics)] https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html
-
[A Beginner’s Aha Moments for Word2Vec] : https://yidatao.github.io/2017-08-03/word2vec-aha/
-
[Glove, Word2Vec, Fastext classes] : https://github.com/makcedward/nlp/blob/master/sample/nlp-word_embedding.ipynb
-
[!!! Very nice tutorial about how word2vec works] : https://towardsdatascience.com/word2vec-made-easy-139a31a4b8ae
-
[!!! An implementation guide to Word2Vec using NumPy and Google Sheets] : https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281
-
[WordRank embedding: “crowned” is most similar to “king”, not word2vec’s “Canute”] : https://rare-technologies.com/wordrank-embedding-crowned-is-most-similar-to-king-not-word2vecs-canute/
-
[Nice !!! ]: https://github.com/JacopoMangiavacchi/SwiftNLC/tree/master/ModelNotebooks
- Create Model With GloVe Embedding Bidirectional With Attention
- Create Model With FastText Embedding
- Create Model With GloVe Embedding
- Create Model With GloVe Embedding Bidirectional
- Create Model With NLTK Embedding
- Create Model With NS Linguistic Tagger Embedding
-
[Tutorial from ENSAE] : http://www.xavierdupre.fr/app/papierstat/helpsphinx/notebooks/text_sentiment_wordvec.html#les-donnees
-
[CoreML with GloVe Word Embedding and Recursive Neural Network - nice tutorial] : https://medium.com/@JMangia/coreml-with-glove-word-embedding-and-recursive-neural-network-part-2-ab238ca90970
-
[Big Benchmark] : http://nlp.town/blog/sentence-similarity/
- Average W2V
- Average W2V + Stopwords
- Average W2V + TFIDF
- Average W2V + TFIDF + Stopwords
- Average Glove
- Average Glove + Stopwords
- Average Glove + TFIDF
- Average Glove + TFIDF + Stopwords
- W2V + WMD
- W2V + Stopwords + WMD
- Glove + WMD
- Glove + Stopwords + WMD
- Smooth Inverse Frequency + W2V
- Smooth Inverse Frequency + Glove
- InferSent (INF)
- GSE (Google Sentence Encoder)
InferSent (INF) = pre-trained encoder that was developed by Facebook Research. It is a BiLSTM with max pooling, trained on the SNLI dataset, 570k English sentence pairs labelled with one of three categories: entailment, contradiction or neutral.
GSE (Google Sentence Encoder) = Google’s answer to Facebook’s InferSent. It comes in two forms:
- an advanced model that takes the element-wise sum of the context-aware word representations produced by the encoding subgraph of a Transformer model
- a simpler Deep Averaging Network (DAN) where input embeddings for words and bigrams are averaged together and passed through a feed-forward deep neural network.
====-> work with Pearson correlation
-
[How to predict Quora Question Pairs using Siamese Manhattan LSTM] : https://medium.com/mlreview/implementing-malstm-on-kaggles-quora-question-pairs-competition-8b31b0b16a07
-
[Latent Semantic Indexing (LSI) - An Example with mathematics] : http://www1.se.cuhk.edu.hk/~seem5680/lecture/LSI-Eg.pdf
-
[Finding similar documents with Word2Vec and WMD] : https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
-
[Cosine Similarity] : https://www.machinelearningplus.com/nlp/cosine-similarity/
-
[Tutorial on LSI] : http://poloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-820-TextAlgorithms.pdf
http://robotics.stanford.edu/~scohen/research/emdg/emdg.html#flow_eqw_notopt
http://robotics.stanford.edu/~rubner/slides/sld014.htm
http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/
-
[Beyond Cosinu > Jensen-Shannon + Hypothesis Test] : http://stefansavev.com/blog/beyond-cosine-similarity/
-
[Great ressources with MANY MANY notebooks] : https://www.renom.jp/index.html?c=tutorial
-
[LE TRANSPORT OPTIMAL, COUTEAU SUISSE POUR LA DATA SCIENCE]: https://weave.eu/le-transport-optimal-un-couteau-suisse-pour-la-data-science/
-
[BEST TUTORIAL variational-autoencoders] : https://www.jeremyjordan.me/variational-autoencoders/
-
[BEST TUTORIAL : Earthmover Distance !!!!!] : https://jeremykun.com/2018/03/05/earthmover-distance/
Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).
-
[Introduction to Wasserstein metric (earth mover’s distance) -> Mathematics]: https://yoo2080.wordpress.com/2015/04/09/introduction-to-wasserstein-metric-earth-movers-distance/
-
[Word Mover’s distance calculation between word pairs of two documents] : https://stats.stackexchange.com/questions/303050/word-movers-distance-calculation-between-word-pairs-of-two-documents
-
[WMD + Word2Vec] : https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb
-
[Books about Optimal Transport] : https://optimaltransport.github.io/pdf/ComputationalOT.pdf
-
[NICE !!!!!! How Autoencoders work - Understanding the math and implementation] : https://www.kaggle.com/shivamb/how-autoencoders-work-intro-and-usecases
-
[Word2Vec to convert each question into a semantic vector then stack a Siamese network to detect if the pair is duplicate] : http://www.erogol.com/duplicate-question-detection-deep-learning/
-
[Amazing !!!] : https://github.com/makcedward/nlp
-
Distance Measurement:
- Euclidean Distance, Cosine Similarity and Jaccard Similarity
- Edit Distance + Levenshtein Distance
- Word Moving Distance (WMD)
- Supervised Word Moving Distance (S-WMD)
- Manhattan LSTM
-
Text Representation:
1. Traditional Method
- Bag-of-words (BoW)
- Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)
2. Character Level
- Character Embedding
3. Word Level
- Negative Sampling and Hierarchical Softmax
- Word2Vec, GloVe, fastText
- Contextualized Word Vectors (CoVe)
- Embeddings from Language Models (ELMo)
- Generative Pre-Training (GPT)
- Contextual String Embeddings
- Self-Governing Neural Networks (SGNN)
- Multi-Task Deep Neural Networks (MT-DNN)
- Generative Pre-Training-2 (GPT-2)
- Universal Language Model Fine-tuning (ULMFiT)
4. Sentence Level
- Skip-thoughts
- InferSent
- Quick-Thoughts
- General Purpose Sentence (GenSen)
- Bidirectional Encoder Representations from Transformers (BERT)
-