The Pipe

Feed PDFs, word docs, slides, web pages and more into Vision-LLMs with one line of code ⚡

The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that require a deep understanding of tricky data sources. The Pipe is available as a hosted API at thepi.pe, or it can be set up locally.

Features 🌟

Extracts text and visuals from files or web pages 📚
Outputs chunks optimized for multimodal LLMs 🖼️
Interpret complex PDFs, web pages, slides, CSVs, and more 🧠
Auto-compress prompts exceeding your chosen token limit 📦
Works even with missing file extensions, in-memory data streams 💾
Works with codebases, git repos, and custom integrations 🌐
Multi-threaded ⚡️

Getting Started 🚀

First, install The Pipe.

pip install thepipe_api

Ensure the THEPIPE_API_KEY environment variable is set. Don't have an API key yet? Get one here. Alternatively, see the local installation section for the more advanced local setup.

Now you can extract comprehensive text and visuals from any file:

from thepipe_api import thepipe
chunks = thepipe.extract("example.pdf")

Or any website:

chunks = thepipe.extract("https://example.com")

Then feed it into GPT-4-Vision:

response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages = chunks,
)

The Pipe's output is a list of sensible "chunks", and thus can be used either for storage in a vector database or for direct use as a prompt. Extra features such as data table extraction, bar chart extraction, custom web authentications and more are available in the API documentation. LiteLLM can be used to easily integrate The Pipe with any LLM provider.

You can also use The Pipe from the command line. Here's how to recursively extract from a directory, matching only a specific file type:

thepipe path/to/folder --match *jsx

Supported File Types 📚

Source Type	Input types	Token Compression 🗜️	Image Extraction 👁️	Notes 📌
Directory	Any `/path/to/directory`	✔️	✔️	Extracts from all files in directory, supports match and ignore patterns
Code	`.py`, `.tsx`, `.js`, `.html`, `.css`, `.cpp`, etc	✔️ (varies)	❌	Combines all code files. `.c`, `.cpp`, `.py` are compressible with ctags, others are not
Plaintext	`.txt`, `.md`, `.rtf`, etc	✔️	❌	Regular text files
PDF	`.pdf`	✔️	✔️	Extracts text and images of each page; can use AI for extraction of table data and images within pages
Image	`.jpg`, `.jpeg`, `.png`	❌	✔️	Extracts images, uses OCR if text_only
Data Table	`.csv`, `.xls`, `.xlsx`	✔️	❌	Extracts data from spreadsheets; converts to text representation. For very large datasets, will only extract column names and types
Jupyter Notebook	`.ipynb`	❌	✔️	Extracts code, markdown, and images from Jupyter notebooks
Microsoft Word Document	`.docx`	✔️	✔️	Extracts text and images from Word documents
Microsoft PowerPoint Presentation	`.pptx`	✔️	✔️	Extracts text and images from PowerPoint presentations
Website	URLs (inputs containing `http`, `https`, `ftp`)	✔️	✔️	Extracts text from web page along with image (or images if scrollable); text-only extraction available
GitHub Repository	GitHub repo URLs	✔️	✔️	Extracts from GitHub repositories; supports branch specification
ZIP File	`.zip`	✔️	✔️	Extracts contents of ZIP files; supports nested directory extraction

How it works 🛠️

The pipe is accessible from the command line or from Python. The input source is either a file path, a URL, or a directory. The pipe will extract information from the source and process it for downstream use with language models, vision transformers, or vision-language models. The output from the pipe is a sensible text-based (or multimodal) representation of the extracted information, carefully crafted to fit within context windows for any models from gemma-7b to GPT-4. It uses a variety of heuristics for optimal performance with vision-language models, including AI filetype detection with filetype detection, AI PDF extraction, efficient token compression, automatic image encoding, reranking for lost-in-the-middle effects, and more, all pre-built to work out-of-the-box.

Local Installation 🛠️

To use The Pipe locally, you will need playwright, ctags, pytesseract, and the local python requirements, which differ from the more lightweight API requirements. You will also need to use the local version of the requirements file:

git clone https://github.com/emcf/thepipe
pip install -r requirements_local.txt

Tip for windows users: you may need to install the python-libmagic binaries with pip install python-magic-bin. You may also need to ensure the tesseract-ocr binaries and the ctags binaries are in your PATH.

Now you can use The Pipe with Python:

from thepipe_api import thepipe
chunks = thepipe.extract("example.pdf", local=True)

or from the command line:

thepipe path/to/folder --local

Arguments are:

source (required): can be a file path, a URL, or a directory path.
local (optional): Use the local version of The Pipe instead of the hosted API.
match (optional): Regex pattern to match files in the directory.
ignore (optional): Regex pattern to ignore files in the directory.
limit (optional): The token limit for the output prompt, defaults to 100K. Prompts exceeding the limit will be compressed.
ai_extraction (optional): Extract tables, figures, and math from PDFs using our extractor. Incurs extra costs.
text_only (optional): Do not extract images from documents or websites. Additionally, image files will be represented with OCR instead of as images.

gfranxman / thepipe