A curated list of resources, such as papers, tutorials, code, etc., on the topic of document similarity measures.
Please feel free to create a pull request if you would like to add other awesome papers.
The goal of this repository is to provide a comprehensive overview for students and reseachers.
Document similarity measures are basis the several downstream applications in the area of natural language processing (NLP) and information retrieval (IR).
Among the most common applications are clustering, duplicate or plagirism detection and content-based recommender systems.
We selected the following content while having primarily the recommender systems application in mind.
In particular, we focus on literature recommender systems that need to assess the similarity of long-form and rich content documents. "Long-form" refers to the amount of document content from +100 sentences, whereas rich content means that documents contain aside from text also images, mathematical equations and citations/links.
Documents might be declared as similar, when they, e.g., cover the same topic, use a common set of words or are written in the same font. In IR the dimension of similarity defines the understanding of similarity. We distinguish between the following dimensions: lexical, structural and semantic document similarity.
Moreover, similarity is not a binary decision. In many cases declaring two things as similar or not, is not suitable. Instead, the degree of similarity is measured. Similarity measures express therefore document similarity as normalised scalar score, which is within an interval of zero to one. The highest degree of similarity is measured as one. When two objects are dissimilar, the degree of similarity is zero.
- D. Lin, “An Information-Theoretic Definition of Similarity,” Proc. ICML, pp. 296–304, 1998.
The lexical document similarity of two documents depends on the words, which occur in the document text. A total overlap between vocabularies would result in a lexical similarity of 1, whereas 0 means both documents share no words. This dimension of similarity can be calculated by a simple word-to-word comparison. Methods like stemming or stop word removal increase the effectiveness of lexical similarity.
- J. B. Lovins, “Development of a stemming algorithm,” Mech. Transl. Comput. Linguist., vol. 11, no. June, pp. 22–31, 1968.
- C. Fox, “A stop list for general text,” ACM SIGIR Forum, vol. 24, no. 1–2. pp. 19–21, 1989.
Whereas lexical similarity focuses on the vocabulary, structural similarity describes the conceptual composition of two documents. Structural document similarity stretches from graphical components, like text layout, over similarities in the composition of text segments, e.g., paragraphs or sentences that are lexical similar, to the arrangement of citations and hyperlinks.
Structural similarity is mainly used for semi-structured document formats, such as XML or HTML. A common, yet expensive in computing time, approach is to calculate the minimum cost edit distance between any two document structures. The cost edit distance is the number of actions that are required to change a document so it is identical to another document
- C. Shin and D. Doermann, “Document image retrieval based on layout structural similarity,” Int. Conf. Image Process. Comput. Vision, Pattern Recognit., pp. 606–612, 2006.
- D. Buttler, “A short survey of document structure similarity algorithms,” Int. Conf. Internet Comput., pp. 3–9, 2004.
Two documents are considered as semantically similar when they cover related topics or have the same semantic meaning. A proper determination of the semantic similarity is essential for many IR systems, since in many use cases the users information need is rather about the semantic meaning than the vocabulary or structure of a document. However, measuring topical relatedness is a complex task. Therefore, lexical or structural similarity is often used to approximate semantic similarity. The example below shows that approximating semantic by lexical similarity is not always suitable.
- Our earth is round.
- The world is a globe.
Even if both sentences are lexically different, because they have only one out of five words in common, the semantic meaning of the sentences is synonymous.
Text similarity
- Bär, D., Zesch, T., & Gurevych, I. (2011). A reflective view on text similarity. International Conference Recent Advances in Natural Language Processing, RANLP, (September), 515–520.
- Bär, D., Zesch, T., & Gurevych, I. (2015). Composing Measures for Computing Text Similarity. Technical Report TUD-CS-2015-0017, 1–30.
Psychological
- Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327.
- Medin, D. L., Goldstone, R. L., & Gentner, D. (1993). Respects for Similarity. Psychological Review, 100(2), 254–278.
Narratives
In order to compute the similarity of two documents, a numeric representation of the documents must be derived.
These representations are usually vectors of real numbers, whereby the vectors can be either sparse or dense. The terms "sparse" and "dense" refer to the number of zero vs. non-zero elements in the vectors. A sparse vector is one that contains mostly zeros and few non-zero entries. A dense vector contains mostly non-zeros.
In addition, we distinguish document representations based on the type of data that they rely on, e.g., text, topics, citations.
In the context of machine learning approaches that produce dense vector representations, the process is often refered to as learning of document features.
- Bag-of-Words
- VSM
- TF-IDF (IDF variants, BM25,)
- Contextualized Word Vectors (CoVe). Paper, Code
- Embeddings from Language Models (ELMo). Paper
- Contextual String Embeddings (Zalando Flair). Paper
- Average
- Weighted Average
- Smooth Inverse Frequency. A Simple but Tough-to-Beat Baseline for Sentence Embeddings
We find that using a similarity based on angular distance performs better on average than raw cosine similarity.
InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.
-
ERCNN: Enhanced Recurrent Convolutional Neural Networks for Learning Sentence Similarity Paper. Code Code II
-
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations Paper Code
- BERT Paper
- GPT Paper
- Generative Pre-Training-2 (GPT-2) Paper
- Universal Language Model Fine-tuning (ULMFiT) Paper
- XLNet
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Paper Code
- Sentence Transformers
- Li, B., Zhou, H., He, J., Wang, M., Yang, Y., & Li, L. (2020). On the Sentence Embeddings from Pre-trained Language Models. EMNLP 2020. (Code)
BERT pooling strategies:
Bert is amazing at encoding texts, just pool the contextualized embeddings. If not pre-training, use the 2nd to last layer as it usually gets better results (see the research in the bert-as-a-service github repo). Reddit
Better BERT embeddings (fix anisotropic problem):
- SimCSE: Simple Contrastive Learning of Sentence Embeddings
- WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach
Overcoming BERT's 512 token limit:
The Transformer self-attention mechanism imposes a quadratic cost with respect to sequence length, limiting its wider application, especially for long text.
-
See ICLR 2020 reviews:
-
Easy-to-use interface to fine-tuned BERT models for computing semantic similarity
-
Ye, Z. et al. 2019. BP-Transformer: Modelling Long-Range Context via Binary Partitioning. (2019). Paper Code
- LSA/LSI
- LDA
- LDA2Vec
Classical
- Bibliographic coupling. Martyn, J (1964)
- Co-Citation. Small, Henry (1973)
- Co-Citation Proximity Analysis. Gipp, Bela; Beel, Joeran (2006)
- Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?
-
Cite2vec: Citation-Driven Document Exploration via Word Embeddings. Paper
-
hyperdoc2vec: Distributed Representations of Hypertext Documents. Paper
-
Graph Embedding for Citation Recommendation. Paper
General graph embedding methods:
- DeepWalk: Online Learning of Social Representations. Paper
- LINE: Large-scale Information Network Embedding. Paper. Code
- node2vec. Paper
- Poincare Embeddings. Nickel and Kiela (2017) Gensim Implementation
Various implementations:
- GraphVite - graph embedding at high speed and large scale
- Karate Club is an unsupervised machine learning extension library for NetworkX.
- Concatenate Text + Citation Embeddings: Evaluating Document Representations for Content-based Legal Literature Recommendations
- Citation Prediction as Training Objective: SPECTER: Document-level Representation Learning using Citation-informed Transformers
Nearest neighbours in embedding space are considered to be similar.
-
Euclidean Distance
-
Jaccard Similarity
-
Jensen-Shannon distance
-
Cosine similarity: Cosine-similarity treats all dimensions equally.
-
Soft cosine
-
Manhatten distance = L1 norm (see also Manhattan LSTM)
-
Edit distance
-
Levenshtein Distance
-
Supervised Word Moving Distance (S-WMD)
Implementations
Siamese networks (Bromley, Jane, et al. "Signature verification using a siamese time delay neural network". Advances in neural information processing systems. 1994.) are neural networks containing two or more identical subnetwork components.
It is important that not only the architecture of the subnetworks is identical, but the weights have to be shared among them as well for the network to be called "siamese". The main idea behind siamese networks is that they can learn useful data descriptors that can be further used to compare between the inputs of the respective subnetworks. Hereby, inputs can be anything from numerical data (in this case the subnetworks are usually formed by fully-connected layers), image data (with CNNs as subnetworks) or even sequential data such as sentences or time signals (with RNNs as subnetworks).
-
Siamese LSTM
-
Liu, B. et al. 2018. Matching Article Pairs with Graphical Decomposition and Convolutions. (Feb. 2018). (Code)
-
Siamese Neural Networks built upon multihead attention mechanism for text semantic similarity task
Binary classifcation
Multi-label classification
- (Binary) Cross Entropy
- Mean Square Error
- ...
The two Siamese subnetworks encode input data into u and v. Various approaches exist to then concatenate the encoded input.
- Reimers, N. and Gurevych, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. The 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019) (2019).
Variations
Paper | Concatenation | Comment |
---|---|---|
InferSent | (u;v;|u-v|;u*v) |
no evaluation |
Sentence-BERT | (u;v;|u-v|) |
The most important component is the element-wise difference |u−v| ... The element-wise difference measures the distance between the dimensions of the two sentence embeddings, ensuring that similar pairs are closer and dissimilar pairs are. |
Universal Sentence Encoder | ||
Pairwise Document Classification | (u;v;|u-v|;u*v) |
Performaning concatenation (Paper) |
- https://www.reddit.com/r/MachineLearning/comments/e525c6/d_what_beats_concatenation/
- FiLM
- Feature-wise transformations (Distill)
A matching network with multi-layer perceptron (MLP) is a standard way to mix multi-dimensions of information (Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks)
- Mueller, J. and Thyagarajan, A. 2014. Siamese Recurrent Architectures for Learning Sentence Similarity. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). 2012 (2014), 1386–1393. DOI:https://doi.org/10.1109/CVPR.2014.180.
- Large expert-curated database for benchmarking document similarity detection in biomedical literature search. Paper, Data
- SciDocs - The Dataset Evaluation Suite for SPECTER (for classification, citation prediction, user activity, recommendation)
- CSFCube -- A Test Collection of Computer Science Research Articles for Faceted Query by Example
- STSbenchmark
- A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
- A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications
- Chen, Mingang, and Pan Liu. "Performance Evaluation of Recommender Systems." International Journal of Performability Engineering 13.8 (2017).
- Bobadilla, Jesus, et al. "Reliability quality measures for recommender systems." Information Sciences 442 (2018): 145-157.
- Text Similarities: Estimate the degree of similarity between two texts Repo
- Michael J. Pazzani, Daniel Billsus. Content-Based Recommendation Systems (PDF)
- Charu C. Aggarwal. Content-Based Recommender Systems
- Sentence Similarity Calculator (ELMo, BERT and Universal Sentence Encoder, and different similarity measures)
Awesome: