tensorlakeai / indexify

A realtime and indexing and structured extraction engine for Unstructured Data to build Generative AI Applications

Home Page:https://getindexify.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add Data Transformers to Data Repository

diptanu opened this issue · comments

Content is extracted when a developer binds an extractor to a data repository. As new content lands the extractors are applied on the content and the derived information is written to indexes.

Extractors are responsible for chunking content, for ex splitting text in a document before they are embedded. Certain extractors like NER and Embedding extractors could be sharing the same chunked content since the context length of the underlying models of the extractors is limited. Currently these extractors duplicate the text splitting work.

The solution would be to introduce a high level transformer concept which can apply algorithms content and store the intermediate representation such as - splitting text into smaller chunks, extracting log mel features from audio files (as most speech models use log mel features), applying filters to images, etc. The intermediate/processed content will live in buffers - a logical storage abstraction that will trigger the extractors when data lands in them.

So it will look some thing like -
Content -> Transformers -> Buffer -> Extractors -> Index (continuosly)

could buffers be a a persistent queue like kafka or redis, i.e. serialized through protobuf? or were you thinking something more structured?