riveSunder / mybrary

private library and smart notes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mybrary: CLI semantic search for individual PDFs

mybrary is a simple tool I use to search pdfs using vector similarity metrics on tranformer language model embeddings.

It uses pdfminer to parse pdf text, and a sentence transformer model (Hugging Face) to generate vector embeddings. Currently, pdf_search uses L2 distance and cosine similarity.

Getting started

I use virtualenv to manage dependencies in a virtual environment:

git clone git@github.com:riveSunder/mybrary.git
# git clone https://github.com/riveSunder/mybrary.git

cd mybrary

virtualenv mybrary_env --python=python3.8
source mybrary_env/bin/activate

pip install pdfminer.six==20221105 torch transformers
# or 
# pip install -r requirements.txt

Usage

With the brary/pdf_search.py you can make a semantic search query of a single pdf from the command line. You can check the available input arguments by calling help with `python -m brary.pdf_search -h

usage: pdf_search.py [-h] [-q QUERY [QUERY ...]] [-d DIRECTORY] [-i INPUT] [-k K]

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY [QUERY ...], --query QUERY [QUERY ...]
                        query or queries to search for in pdf text(separate by space and use single quote marks)
  -d DIRECTORY, --directory DIRECTORY
  -i INPUT, --input INPUT
                        filename for pdf of interest
  -k K, --k K           k for top-k matches to report

You can make multiple queries enclosed in single or double quotation marks to distinguish each string. Passing an integer value of 2 or greater to -k will return the top-k results, and the tool returns results for cosine similarity and L2 distance separately by default.

As an example, we can query the book Conway's Game of Life: Mathematics and Construction by Nathaniel Johnston and Dave Greene with a statement about gliders and universal computation (Thanks to the authors for making a free pdf available, -> the hardcover is available for purchase).

The query:

python -m brary.pdf_search -q "Gliders, like the reflex glider discovered by Richard Guy in the 60s, and other mobile patterns provide the basic functions for building universal computation: moving, transforming (in their collisions), and storing (via collision synthesis/destruction) information." -i conway_life_book.pdf -k 1

yields a related snippet:

loading pre-computed vectors from data/conway_life_book.pt

0th best match for query Gliders, like the reflex glider discovered by Richard Guy in the 60s, and other mobile patterns provide the basic functions for building universal computation: moving, transforming (in their collisions), and storing (via collision synthesis/destruction) information.
	 with cosine similarity 0.721

 Remarks

The importance of glider synthesis was known essentially as soon as the glider itself was found in
1970, with common folklore being that we could send gliders as signals throughout the Life plane
and collide those gliders in different ways to simulate arbitrary computations. This basic idea has
been refined and made more precise repeatedly over the past 50 years, to the point that there are now
explicit patterns that do exactly this—they collide gliders so as to perform arbitrary computations and
build almost any pattern of our choosing (we will delve deeply into the specifics of how these patterns
work in
0th best match for query Gliders, like the reflex glider discovered by Richard Guy in the 60s, and other mobile patterns provide the basic functions for building universal computation: moving, transforming (in their collisions), and storing (via collision synthesis/destruction) information.
	 with l2 distance 4.438

 Remarks

The importance of glider synthesis was known essentially as soon as the glider itself was found in
1970, with common folklore being that we could send gliders as signals throughout the Life plane
and collide those gliders in different ways to simulate arbitrary computations. This basic idea has
been refined and made more precise repeatedly over the past 50 years, to the point that there are now
explicit patterns that do exactly this—they collide gliders so as to perform arbitrary computations and
build almost any pattern of our choosing (we will delve deeply into the specifics of how these patterns
work in
enter another query (0 to end)

In the above case cosine similarity and L2 distance return the same passage, but these sometimes differ.

About

private library and smart notes

License:Apache License 2.0


Languages

Language:Python 100.0%