Retrieval Framework

This is a tool that converts scientific PDFs into plain text for your LLM-related needs.

Convert PDF to LaTeX using Mathpix API that is tailored to work with scientific papers.
Extract images and tables from LaTeX and replace them with text using a multimodal LLM.
- The prompts are made to extract all values and relationships represented within each table or graph and minimize information loss.

Using with LlamaIndex 🦙

See hierarchical_retrieval.ipynb for example LlamaIndex workflow.

It uses hierarchical retrieval to utilize text descriptions generated by GPT together to retrieve original tables and images.

Basic usage

Set MATHPIX_APP_ID and MATHPIX_APP_KEY in your environment. We suggest using a .env file.

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(".env"))  # read local .env file

Instantiate a text and a vision model. This tool uses LlamaIndex abstractions to interface with LLMs.

from llama_index.llms import OpenAI
from llama_index.multi_modal_llms import OpenAIMultiModal

text_model = OpenAI()
vision_model = OpenAIMultiModal(max_new_tokens=4096)

Next, pass those models to the converter.

converter = MathpixPdfConverter(text_model=text_model, vision_model=vision_model)

Convert PDF and extract the result.

pdf_path = Path("path/to/file.pdf")

pdf_result = converter.convert(pdf_path)

with Path(f"output.txt").open("w") as f:
    f.write(pdf_result.content)

Custom workflow

In order to persist intermediate results or run processing in parallel, you can use MathpixProcessor and MathpixResultParser directly.

processor = MathpixProcessor()
parser = MathpixResultParser(text_model=text_model, vision_model=vision_model)

mathpix_result = processor.submit_pdf(pdf_path)
mathpix_result = processor.await_result(mathpix_result)
pdf_result = parser.parse_result(mathpix_result)

About

A tool that converts scientific PDFs into plain text for your LLM-related needs, such as building RAGs or agents for academic knowledge. It was developed in collaboration with the LlamaIndex team.

Languages

Language:Jupyter Notebook 89.1%Language:Python 10.9%