jaemk / cached

Rust cache structures and easy function memoization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DiskCache blobs aren't cleaned on overwrite

RustyNova016 opened this issue · comments

I am using a DiskCache with an inner vector, making files in the blob folder of the cache. However, those blob files aren't removed/overwritten when I'm overwriting a cache entry. This leads to inflated storage usage that I rather not have since those structs are heavy.

Here's a example program to test the bug:

use cached::{DiskCache, IOCached};
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize)]
struct BigStruct {
    pub id: String,
    pub data: Vec<Vec<u128>>
}

fn main() {
    let cache: DiskCache<String, BigStruct> = DiskCache::new("test").build().unwrap();

    // Fill the cache with lots of data
    for i in 0..100 {
        let big_struct = BigStruct {
            id: i.to_string(),
            data: (0..100).into_iter().map(|j| {
                (0..100).into_iter().map(|k| {
                    i + j + k
                }).collect()
            }).collect()
        };

        // cache_remove doesn't do anything either!
        cache.cache_remove(&big_struct.id).unwrap();
        cache.cache_set(big_struct.id.clone(), big_struct).unwrap();
    }
}

Update:
DiskCacheBuilder::set_sync_to_disk_on_cache_change doesn't change anything.

Running this test multiple times seems to reproduce the problem of accumulating files in the blobs folder of the cache:

#[io_cached(
    disk = true,
    time = 1000000,
    map_error = r##"|e| TestError::DiskError(format!("{:?}", e))"##,
    key = "u32",
    convert = r"{ n }",
)]
async fn cached_disk(n: u32) -> Result<Vec<u32>, TestError> {
    Ok((0..n).collect::<Vec<_>>())
}

#[tokio::test]
async fn test_cached_disk() {
    let n = 1_000_000;
    let expected = Ok((0..n).collect::<Vec<_>>());
    assert_eq!(cached_disk(n).await, expected);
    assert_eq!(cached_disk(n).await, expected);
}

Unfortunately, this issue makes DiskCache useless.

Thanks for pointing this out. I'm able to reproduce also. It looks like this is due to either a bug or the intended operation of the underlying sled::Db. Also looks like development has slowed down on that project. It would probably be best to change the backing store to something more reliable, like sqlite

There is a potential ugly workaround until the storage backend is swapped. DiskCacheBuilder lets you set a specific disk directory set_disk_directory where each function's file cache will be stored. It looks like sled is continuously creating a new copy of the blob data with the name being an incremental integer. You could scan this directory on startup, for each sub-dir (the name of a function in all caps + _v1) scan the blobs/ dir, sort the files and delete all but the latest file (either keep the largest integer name or check the file metadata for last created)

Thanks for pointing this out. I'm able to reproduce also. It looks like this is due to either a bug or the intended operation of the underlying sled::Db. Also looks like development has slowed down on that project. It would probably be best to change the backing store to something more reliable, like sqlite

Thanks for prompt reply.

It seems to me that sled::Db is working fine in a sense that it does speed up the function look-up operations, yet it duplicates "blobs" as you mentioned. Regarding sqlite - it wouldn't be my first choice due to features overkill for this purpose, and it isn't a cargo-friendly dependency.

There is a potential ugly workaround until the storage backend is swapped. DiskCacheBuilder lets you set a specific disk directory set_disk_directory where each function's file cache will be stored. It looks like sled is continuously creating a new copy of the blob data with the name being an incremental integer. You could scan this directory on startup, for each sub-dir (the name of a function in all caps + _v1) scan the blobs/ dir, sort the files and delete all but the latest file (either keep the largest integer name or check the file metadata for last created)

I thought about something similar myself, but it is really an ugly workaround :-) I'll try.

It would probably be best to change the backing store to something more reliable, like sqlite

RocksDB may be a better choice.

Personally I have switched to sled DB at home and did a combination of Cacache for the storage and serde_rmp to convert to binary data

Not sure if it's the best way tho. But once this issue is sorted I'm definitely switching back