DiskCache blobs aren't cleaned on overwrite

Question

DiskCache blobs aren't cleaned on overwrite

RustyNova016 opened this issue 3 months ago · comments

I am using a DiskCache with an inner vector, making files in the blob folder of the cache. However, those blob files aren't removed/overwritten when I'm overwriting a cache entry. This leads to inflated storage usage that I rather not have since those structs are heavy.

Here's a example program to test the bug:

use cached::{DiskCache, IOCached};
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize)]
struct BigStruct {
    pub id: String,
    pub data: Vec<Vec<u128>>
}

fn main() {
    let cache: DiskCache<String, BigStruct> = DiskCache::new("test").build().unwrap();

    // Fill the cache with lots of data
    for i in 0..100 {
        let big_struct = BigStruct {
            id: i.to_string(),
            data: (0..100).into_iter().map(|j| {
                (0..100).into_iter().map(|k| {
                    i + j + k
                }).collect()
            }).collect()
        };

        // cache_remove doesn't do anything either!
        cache.cache_remove(&big_struct.id).unwrap();
        cache.cache_set(big_struct.id.clone(), big_struct).unwrap();
    }
}

RustyNova · Answer 1 · Mon Apr 29 2024 16:10:43 GMT+0800 (China Standard Time)

Update:
DiskCacheBuilder::set_sync_to_disk_on_cache_change doesn't change anything.

ronanM · Answer 2 · Mon Jul 01 2024 21:57:15 GMT+0800 (China Standard Time)

Running this test multiple times seems to reproduce the problem of accumulating files in the blobs folder of the cache:

#[io_cached(
    disk = true,
    time = 1000000,
    map_error = r##"|e| TestError::DiskError(format!("{:?}", e))"##,
    key = "u32",
    convert = r"{ n }",
)]
async fn cached_disk(n: u32) -> Result<Vec<u32>, TestError> {
    Ok((0..n).collect::<Vec<_>>())
}

#[tokio::test]
async fn test_cached_disk() {
    let n = 1_000_000;
    let expected = Ok((0..n).collect::<Vec<_>>());
    assert_eq!(cached_disk(n).await, expected);
    assert_eq!(cached_disk(n).await, expected);
}

Andrej Dundovic · Answer 3 · Sat Jul 20 2024 20:58:35 GMT+0800 (China Standard Time)

Unfortunately, this issue makes DiskCache useless.

James Kominick · Answer 4 · Sat Jul 20 2024 22:24:32 GMT+0800 (China Standard Time)

Thanks for pointing this out. I'm able to reproduce also. It looks like this is due to either a bug or the intended operation of the underlying sled::Db. Also looks like development has slowed down on that project. It would probably be best to change the backing store to something more reliable, like sqlite

James Kominick · Answer 5 · Sat Jul 20 2024 22:39:06 GMT+0800 (China Standard Time)

There is a potential ugly workaround until the storage backend is swapped. DiskCacheBuilder lets you set a specific disk directory set_disk_directory where each function's file cache will be stored. It looks like sled is continuously creating a new copy of the blob data with the name being an incremental integer. You could scan this directory on startup, for each sub-dir (the name of a function in all caps + _v1) scan the blobs/ dir, sort the files and delete all but the latest file (either keep the largest integer name or check the file metadata for last created)

Andrej Dundovic · Answer 6 · Sat Jul 20 2024 23:19:27 GMT+0800 (China Standard Time)

Thanks for pointing this out. I'm able to reproduce also. It looks like this is due to either a bug or the intended operation of the underlying sled::Db. Also looks like development has slowed down on that project. It would probably be best to change the backing store to something more reliable, like sqlite

Thanks for prompt reply.

It seems to me that sled::Db is working fine in a sense that it does speed up the function look-up operations, yet it duplicates "blobs" as you mentioned. Regarding sqlite - it wouldn't be my first choice due to features overkill for this purpose, and it isn't a cargo-friendly dependency.

There is a potential ugly workaround until the storage backend is swapped. DiskCacheBuilder lets you set a specific disk directory set_disk_directory where each function's file cache will be stored. It looks like sled is continuously creating a new copy of the blob data with the name being an incremental integer. You could scan this directory on startup, for each sub-dir (the name of a function in all caps + _v1) scan the blobs/ dir, sort the files and delete all but the latest file (either keep the largest integer name or check the file metadata for last created)

I thought about something similar myself, but it is really an ugly workaround :-) I'll try.

ronanM · Answer 7 · Sun Jul 21 2024 02:12:00 GMT+0800 (China Standard Time)

It would probably be best to change the backing store to something more reliable, like sqlite

RocksDB may be a better choice.

RustyNova · Answer 8 · Mon Jul 22 2024 02:04:33 GMT+0800 (China Standard Time)

Personally I have switched to sled DB at home and did a combination of Cacache for the storage and serde_rmp to convert to binary data

Not sure if it's the best way tho. But once this issue is sorted I'm definitely switching back