tensorsense / Retrieval-Framework

A tool that converts scientific PDFs into plain text for your LLM-related needs, such as building RAGs or agents for academic knowledge. It was developed in collaboration with the LlamaIndex team.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Retrieval Framework

This is a tool that converts scientific PDFs into plain text for your LLM-related needs.

  • Convert PDF to LaTeX using Mathpix API that is tailored to work with scientific papers.
  • Extract images and tables from LaTeX and replace them with text using a multimodal LLM.
    • The prompts are made to extract all values and relationships represented within each table or graph and minimize information loss.

Using with LlamaIndex 🦙

See hierarchical_retrieval.ipynb for example LlamaIndex workflow.

It uses hierarchical retrieval to utilize text descriptions generated by GPT together to retrieve original tables and images.

Basic usage

  1. Set MATHPIX_APP_ID and MATHPIX_APP_KEY in your environment. We suggest using a .env file.
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(".env"))  # read local .env file
  1. Instantiate a text and a vision model. This tool uses LlamaIndex abstractions to interface with LLMs.
from llama_index.llms import OpenAI
from llama_index.multi_modal_llms import OpenAIMultiModal

text_model = OpenAI()
vision_model = OpenAIMultiModal(max_new_tokens=4096)

Next, pass those models to the converter.

converter = MathpixPdfConverter(text_model=text_model, vision_model=vision_model)
  1. Convert PDF and extract the result.
pdf_path = Path("path/to/file.pdf")

pdf_result = converter.convert(pdf_path)

with Path(f"output.txt").open("w") as f:
    f.write(pdf_result.content)

Custom workflow

In order to persist intermediate results or run processing in parallel, you can use MathpixProcessor and MathpixResultParser directly.

processor = MathpixProcessor()
parser = MathpixResultParser(text_model=text_model, vision_model=vision_model)

mathpix_result = processor.submit_pdf(pdf_path)
mathpix_result = processor.await_result(mathpix_result)
pdf_result = parser.parse_result(mathpix_result)

See also

About

A tool that converts scientific PDFs into plain text for your LLM-related needs, such as building RAGs or agents for academic knowledge. It was developed in collaboration with the LlamaIndex team.


Languages

Language:Jupyter Notebook 89.1%Language:Python 10.9%