apache / iceberg-python

Apache PyIceberg

Home Page:https://py.iceberg.apache.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement caching of manifest-files

Fokko opened this issue · comments

Feature Request / Improvement

We currently loop over the manifests of a snapshot often just once. But now when we're compounding the operations (DELETE+APPEND), there is a fair chance that read a manifest more than once. The spec states that manifest files are immutable, this means that we can cache is locally using a method annotated with the lru_cache.

I am trying to working on this, is it possible to assign it to me?

@swapdewalkar Thanks for picking this up! I've just assigned it to you

Hi @swapdewalkar I wanted to check in and see if you have any updates on this task. If you need any assistance or if there are any obstacles, please let me know—I will be happy to help!

Hi, can we increase the scope of this issue to cache/store all_manifests, data_manifests & delete_manifests? Or do I create a new issue for this? This feature would be useful for tasks like Incremental Scans (Append, Changelog, etc) where we frequently access manifest files. I imagine we would like this feature to be similar to the java implementation.

Also, since @swapdewalkar hasn't responded yet and if they do not have the time/bandwidth for the issue, I'm happy to give this a shot! :)

@chinmay-bhat I think we can generalize this quite easily, since from the spec:

Once written, data and metadata files are immutable until they are deleted.

I think we could go as easy to have a lru-cache based on the path to the metadata to cache it :)

Thanks @Fokko for the quick response! Really appreciate it!

based on the path to the metadata to cache it

I'm not clear on this. Are you saying we can simply add lru_cache to def manifests(self, io: FileIO) in class Snapshot? And then whenever we need data manifests or delete manifests, we iterate over the cached manifests? Wouldn't it be better to cache those too, since as you said, the files are immutable?

For ex:

@lru_cache
def manifests(self, io: FileIO):
......


@lru_cache
def data_manifests(self):
   return [manifest_file for manifest_file in self.manifests(self.io) if manifest_file.content == ManifestContent.DATA]

@chinmay-bhat I don't think it is as easy as that. We should ensure that the manifest_list path is part of the cache. We could share the cache between calls, since if you do subsequent queries, and the snapshot hasn't been updated, this would speed up the call quite a bit.

We could also make the FileIO part of the caching key. I don't think that's stricktly required, but if something changed in the FileIO we might want to invalidate the cache, but I'm open to arguments here.

Thank you for clarifying! Here's how I imagine manifests() would look like :)

@lru_cache()
def manifests(self, io: FileIO, manifest_list_location: str) -> List[ManifestFile]:
    if manifest_list_location is not None:
        file = io.new_input(manifest_list_location)
        return list(read_manifest_list(file))
    return []

When we call snapshot.manifests(self.io, snapshot.manifest_list), if manifest_list is the same, we simply query the cached files. But if the snapshot is updated, manifest_list is also updated, and calling manifests() triggers a re-read of manifest files.
Is this similar to what you have in mind?