mathlib-semantic-search
Semantic search for Lean mathlib using OpenAI's embeddings API and Faiss.
Tested with python 3.10.4. Packages in requirements.txt
.
Running demo
Save an openai api key to the environment variable OPENAI_API_KEY
. Then cd
into src
and run demo_search.py
. The mathlib embeddings are pre-computed, but do note embedding a query will cost you $0.0004 / 1K tokens
.
Recomputing mathlib embeddings
This step is only documented for reproducibility purposes. It is not necessary to run the demo.
Pull leanprover-community/doc-gen and run the gen_docs
script. When generating the docs, we used mathlib commit hash 06d0adfa76594f304b4650d098273d4366ede61b
. Move the generated export.json
to src/parse_docgen/docgen_export.json
.
Then, in the src/parse_docgen
directory run python parse.py
. Then, cd
into the src/embed_mathlib
directory and run python embed_mathlib.py
and embeddings_to_numpy.py
.
demo_search.py
works
How At a high-level, the script embeds queries using the embeddings api, then uses Faiss to do a fast kNN search against precomputed embeddings of mathlib declarations.
The search indexes into the file src/parse_docgen/docgen_export_with_formal_statement.jsonl
, which is loaded into the docs
variable in src/demo_search.py
. The entries of docs
are in exactly the same format as the entries of the export.json
generated by gen_docs
, so in theory we should be use to reuse the doc-gen
code that maps export.json
entries to html.
Each entry of docs
actually contains one additional field compared to export.json
: the formal_statement
field, which is a deparsed string representation of the theorem statement.