psf / cachecontrol

The httplib2 caching algorithms packaged up for use with requests.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Caching partial content

shreyasminocha opened this issue · comments

    data = self.api.get(
        file_url,
        headers={'range': f'bytes={start}-{end}'},
        stream=True
    )
[cachecontrol.controller] Status code 206 not in (200, 203, 300, 301)

Would love to be able to cache partial content.

I guess it is that time of the year: I was just referred to cachecontrol in my quest for a caching http proxy with range requests support. Got inspired by seeing how rclone does it with (seamingly, didn't look inside) sparse files to contain already fetched parts.
Our use case: FUSE file system on top of git/git-annex (datalad) repositories where we have information about http urls for the files content, but do not want to fetch entire files (could be TBs) to just access small portions of the file (e.g. metadata) datalad/datalad#4003 (comment)

Caching big files is not one of cachecontrol's strong suites at the moment. See #238. I'm working towards improving the situation (#240), but the progress is slow: I want to branch by abstraction, but my latest PR (#247) is stuck in the pipe.

At the moment the API that abstracts out the storage in cachecontrol (on master) looks like this

class Cache:
    def get(self, key: str) -> bytes:
        ...

    def set(self, key: str, value: bytes, expires=None) -> None:
        ...

    def delete(self, key: str) -> None:
        ...

    def close(self) -> None:
        ...

The key is derived from the URL and there is only one key per cached request. As you can see caching big files will require changing the API - you can't store the entire contents of a file in a single bytes instance - it'll take too much RAM. Caching partial content will require changes to this API too.

Would you like to join forces and discuss the possible solutions to both problems? Keep in mind that the current holdup is @ionrock rather than the shortage of my time.

If you have a practical problem that you want to solve ASAP I suggest dropping ionrock/cachecontrol in favor of a caching proxy that already implements partial content and large file support.

I suggest dropping ionrock/cachecontrol in favor of a caching proxy that already implements partial content and large file support.

@hexagonrecursion any suggestions? There don't seem to be a lotta options.

@shreyasminocha Sorry. I have no clue either.