activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Home Page:https://activeloop.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEATURE] Option to disable auto commit after data ingestion

HyunggyuJang opened this issue · comments

Description

Currently, versions are made upon data ingestion with the following code:

self.dataset.commit(allow_empty=True)

It seems like every time the commit is made, the full dataset of current state is captured as a corresponding version. So, if the user commits a lot, the storage the versions consumes blows up rapidly.

It becomes problematic if the user ingest small data incrementally, i.e., the dataset between versions are almost the same, so consumes space inefficiently.

The canonical solution for this would be to capture only the diff data for each version, but as I'm not acquainted the codebase, don't know whether it is feasible.

So, instead, I suggest to offer an option that users can choose whether they want "auto-commit" or not when ingest a data.

Use Cases

No response

Hey @HyunggyuJang, thanks a lot for raising the issue. We're already working on this, and I'll be sure to let you know when the updates are released.