Using 3.7.1 global pyenv Why do we weighted Tfidf with glove? We use 2 versions of tfidf currently - the scikit learn version and the gensim version. Why? Not not just use the gensim tfidf?
- add tfidf scores
- label existing links and shared tags
- easier to read output
- ignore list for output
- add USE +
- do combined ranking
- figure how to handle tags
- connect to obsidian and add light randomization
- pairing ignores
- document ignores
- add keyword weighting
- clean up real documents
- pull more example graphs
- try training and learning
- add topological features
- try deep learning
- add link sequence features?
- and viewing pattern features?
- pull in online information features?
https://groups.google.com/g/gensim/c/8wCZp4Ievk4?pli=1 "You can combine similarity matrices by (weighted) averaging:
combinedsimilarity_matrix = SparseTermSimilarityMatrix(0.1 * firstsimilarity_matrix.matrix + 0.9 * second_similarity_matrix.matrix)
You could also view the similarity matrices as sparse directed graphs between words and apply e.g. power iteration to compute a denser closure, where we infer the similarities of previously unconnected words by taking e.g. the harmonic mean of the shortest path between them."
https://github.com/4OH4/doc-similarity
Find and rank relevant content in Python using NLP, TF-IDF and GloVe.
This repository includes two methods of ranking text content by similarity:
- Term Frequency - inverse document frequency (TF-idf)
- Semantic similarity, using GloVe word embeddings
Given a search query (text string) and a document corpus, these methods calculate a similarity metric for each document vs the query. Both methods exist as standalone modules, with explanation and demonstration code inside examples.ipynb
.
There is an associated blog post that explains the contents of this repository in more detail.
The code in this repository utilises, is derived from and extends the excellent Scikit-Learn, Gensim and NLTK packages.
Python 3 (v3.7 tested) and the following packages (all available via pip
):
pip install scikit-learn~=0.22
pip install gensim~=3.8
pip install nltk~=3.4
Or install via the requirements.txt
file:
pip install -r requirements.txt
After installing the requirements (if necessary), open and run examples.ipynb
using Jupyter Lab.
This module is a wrapper around the Scikit-Learn TfidfVectorizer
, with some additional functionality from nltk
to handle stopwords, lemmatization and cosine similarity calculation. To run:
from tfidf import rank_documents
document_scores = rank_documents(search_terms, documents)
There is a self-contained class - DocSim - for running sematic similarity queries. This can be imported as a module and used without additional code:
from docsim import DocSim
docsim = DocSim(verbose=True)
similarities = docsim.similarity_query(query_string, documents)
By default, a GloVe word embedding model is loaded (glove-wiki-gigaword-50
), although a custom model can also be used.
The word embedding models can be quite large and slow to load, although subsequent operations are faster. The multi-threaded version of the class loads the model in the background, to avoid locking the main thread for a significant period of time. It is used in a similar way, although will raise an exception if the model is still loading so the status of the model_ready
property should be checked first. The only difference is the import:
from docsim import DocSim_threaded
To install the package requirements to run the unit tests:
pip install -r requirements_unit_test.txt
To run all test cases, from the repository root:
pytest
Comments and feedback welcome! Please raise an issue if you find any errors or omissions.