suliveevil / org-similarity

Emacs package that helps org-mode users discover similar documents via TF-IDF and cosine similarity

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

org-similarity

org-similarity is a package to help Emacs org-mode users discover similar or related files. Under the hood, it uses Python and scikit-learn for text feature extraction, and nltk for text pre-processing. More specifically, this package provides a function to recursively scan a given directory for org files, clean their content by stripping the front matter and some undesired characters, tokenize them, replace each token with its respective linguistic stem, generate a TF-IDF sparse matrix, and calculate the cosine similarity between these documents and the buffer you are currently working on.

./assets/example.gif

Installation

Clone the repository:

git clone https://github.com/brunoarine/org-similarity

Then load the package by adding it to your load-path in your init.el:

(add-to-list 'load-path "/path/to/org-similarity")
(require 'org-similarity)

Or, with use-package:

(use-package org-similarity
  :load-path  "path/to/org-similarity")

Configuration

There are a few variables that can be set to customize how org-similarity operates and generates the list of similar documents:

;; Directory to scan for possibly similar documents.
;; org-roam users might want to change it to `org-roam-directory'.
(setq org-similarity-directory org-directory)

;; The language passed to the Snowball stemmer in the `nltk' package.  The
;; following languages are supported: Arabic, Danish, Dutch, English, Finnish,
;; French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian,
;; Spanish and Swedish.
(setq org-similarity-language "english")

;; How many similar entries to list at the end of the buffer.
(setq org-similarity-number-of-documents 10)

;; Whether to prepend the list entries with similarity scores.
(setq org-similarity-show-scores nil)

;; Whether the resulting list of similar documents will point to ID property or
;; filename. Default it nil.
;; However, I recommend setting it to `t' if you use `org-roam' v2.
(setq org-similarity-use-id-links nil)

;; Scan for files inside `org-similarity-directory' recursively.
(setq org-similarity-recursive-search nil)

Usage

In a buffer with an open org-mode file, type M-x org-similarity-insert-list RET.

If this is the first time you run that command, it will ask you to install the necessary Python dependencies. By pressing y to continue, it will create a virtual environment under the hood and download the requirements automatically. Once finished, you can close the installation log buffer, and run the command again.

Changelog

2022-12-26 - v0.2

  • Automated installation of Python dependencies (using virtual environments).
  • Better org-roam v2 compatibility.
  • orgparse to parse org-mode files.
  • Refactored and optimized Python code.

2020-12-05 - v0.1-alpha

  • Alpha release of the package.
  • Tested with org-roam v1.

About

Emacs package that helps org-mode users discover similar documents via TF-IDF and cosine similarity

License:GNU General Public License v3.0


Languages

Language:Python 59.6%Language:Emacs Lisp 40.4%