Entry left in cache configured with WithAtomicGetOrAdd when value factory throws

Question

Entry left in cache configured with WithAtomicGetOrAdd when value factory throws

provegard opened this issue a year ago · comments

Thank you for a great caching library!

I have a value factory that may throw an exception. I noticed that if this happens, an entry remains in the cache. This happens when I use WithAtomicGetOrAdd.

Example code:

using BitFaster.Caching.Lfu;

var cache = new ConcurrentLfuBuilder<string, bool>().WithAtomicGetOrAdd().Build();

try
{
    _ = cache.GetOrAdd("foo", s => throw new Exception(s));
}
catch
{
    // ignore
}

Console.WriteLine(cache.Count);                   // prints: 1
Console.WriteLine(string.Join(", ", cache.Keys)); // prints: foo

It happens for an LRU cache as well.

The main problem, as I see it, is that this results in cache eviction. Consider this longer example:

using BitFaster.Caching.Lfu;
using BitFaster.Caching.Scheduler;

var cache = new ConcurrentLfuBuilder<string, bool>().WithAtomicGetOrAdd().WithScheduler(new ForegroundScheduler()).WithCapacity(3).Build();

try
{
    _ = cache.GetOrAdd("aa", _ => true);
    _ = cache.GetOrAdd("bb", _ => true);
    _ = cache.GetOrAdd("bb", _ => true);
    _ = cache.GetOrAdd("cc", _ => true);
    _ = cache.GetOrAdd("cc", _ => true);
    _ = cache.GetOrAdd("foo", s => throw new Exception(s));
}
catch
{
    // ignore
}

Console.WriteLine(cache.Count);                   // prints: 3
Console.WriteLine(string.Join(", ", cache.Keys)); // prints: cc, bb, foo

Here, aa is unexpectedly (in my opinion) evicted from the cache.

Alex Peck · Answer 1 · Fri Jan 13 2023 10:17:37 GMT+0800 (China Standard Time)

Thanks for reporting this. I agree it's not intuitive and the count is wrong. Both LRU and LFU use a common wrapper for atomic, so behavior is the same.

Underneath, a value wrapper similar to Lazy<T> is added to the cache to implement the atomic add. During GetOrAdd, first the wrapper is added atomically (so there is exactly one wrapper per key in the cache), and then the factory method is invoked against the wrapper within a lock similar to Lazy (so the factory method invocation is atomic). When the factory method throws, the empty wrapper is already stored in the cache (hence the count is now incorrect, and "aa" has been evicted to store an empty wrapper for "foo").

After the exception, there are few scenarios to consider:

List cache keys/values - "aa" has been evicted, "foo" exists but has no value (will probably be false in your case since the value is bool (value type) and the default is false).
Call cache.Count - this is wrong, "foo" is not retrievable from the cache, but it is counted.
Call cache.TryGet("foo", out var value). This returns false, which is correct.
Call cache.GetOrAdd("foo", ...) a second time. If the factory method now succeeds, the value will be added to the cache. If there is an exception cache remains in the same state (empty wrapper stored).
Call cache.TryUpdate("foo", ...). This will incorrectly return true.
Call cache.TryRemove("foo", ...). This will incorrectly return true.
Update and Evicted events, if enabled. These will fire when the empty wrapper is updated/evicted.

To fix all this, I can imagine a few approaches with different pros and cons:

Use striped locks, e.g. have n locks and hash each key to a lock, then hold the lock during creation (similar to the internals of ConcurrentDictionary). After the value is created add to the cache and release the lock.
- All of the above scenarios would work as expected.
- Keys mapped to the same bucket will queue on the lock as each value is created, reducing concurrent throughput. E.g. key "aa", "bb", and "cc" all map to the same lock bucket. 3 threads attempt to create values at once, "cc" must wait for "aa" and "bb" factory methods to complete.
- Similar to above, a blocked factory method for a single key can block all callers mapped to the same bucket, reducing concurrent throughput.
- Key will be hashed twice without deeper refactoring. Probably need to do 2 lookups (double checked lock), so greater lookup overhead for both hit and miss but not much.
- Would need a way to configure or dynamically increase the number of locks based on contention. Even so, not possible to totally avoid contention problems.
Store wrappers in a 'creating' buffer to create wrappers up front atomically, and only add them to the cache after the value factory succeeds.
- All of the above scenarios work as expected.
- Effectively you end up with two caches to manage. I tried coding up a simple verion of this based on SingletonCache, and it increased the number of dictionary operations from 2 to 6. Probably it is possible to do better than 6, but it will always be more work and quite a bit more overhead for a cache miss.
- When I benchmarked my experiment, even cache hit latency was > 50% slower. So, this approach also degrades hot path performance.
Eagerly create the wrapper and store in the cache, then synchronize create on the wrapper (as today). Delete the wrapper if an exception is thrown.
- "aa" would still be needlessly evicted, all other scenarios are working.
- If there are a large number of lookups with exceptions, they can evict useful items from the cache.
- Adding exception handling to GetOrAdd has a cost. I haven't measured this yet but perf would be marginally worse.
- There is a potential race if callers overlap fail then success - the success call could add an item which is immediately deleted. This would be hard to debug.
- There is a potential race if callers overlap add with update. If the add fails, it could remove the value inserted by the successful update. This could be mitigated if ICache had a TryRemove(key, value) method, such that values could be checked within the delete call, but that is a breaking change.
Eagerly create the wrapper and store in the cache, then synchronize create on the wrapper (as today). If there is an exception do nothing and store the empty wrapper as of today. Modify Count etc to ignore empty wrappers so they are not observable from outside.
- "aa" would still be needlessly evicted, all other scenarios are working.
- If there are a large number of lookups with exceptions, they can evict useful items from the cache.
- Lookup calls remain fast/no latency penalty. Count is theoretically slower due to checking for empty values, but not measurably in a quick test (since count already iterates the collection to avoid locking the dictionary).

Approaches 3 and 4 are the easiest to extend to IScopedCache, IAsyncCache and IAsyncScopedCache and would not penalize execution speed for cache hits or misses (including exceptions), so I am inclined to go with one of those options. Downside is that empty wrappers can pollute the cache when exceptions are thrown, evicting items.

Better documentation is also needed for this in the wiki to explain the caveats.

Alex Peck · Answer 2 · Fri Jan 13 2023 11:06:01 GMT+0800 (China Standard Time)

Made a draft PR as POC which would result in this behavior, without any speed penalty:

using BitFaster.Caching.Lfu;

var cache = new ConcurrentLfuBuilder<string, bool>().WithAtomicGetOrAdd().Build();

try
{
    _ = cache.GetOrAdd("foo", s => throw new Exception(s));
}
catch
{
    // ignore
}

Console.WriteLine(cache.Count);                   // prints: 0
Console.WriteLine(string.Join(", ", cache.Keys)); // prints: ""

using BitFaster.Caching.Lfu;
using BitFaster.Caching.Scheduler;

var cache = new ConcurrentLfuBuilder<string, bool>()
   .WithAtomicGetOrAdd()
   .WithScheduler(new ForegroundScheduler())
   .WithCapacity(3)
   .Build();

try
{
    _ = cache.GetOrAdd("aa", _ => true);
    _ = cache.GetOrAdd("bb", _ => true);
    _ = cache.GetOrAdd("bb", _ => true);
    _ = cache.GetOrAdd("cc", _ => true);
    _ = cache.GetOrAdd("cc", _ => true);
    _ = cache.GetOrAdd("foo", s => throw new Exception(s));
}
catch
{
    // ignore
}

Console.WriteLine(cache.TryGet("foo", out var _));       // prints: false
Console.WriteLine(cache.Count);                          // prints: 2
Console.WriteLine(string.Join(", ", cache.Keys));        // prints: cc, bb

This doesn't solve eviction of "aa", but inspecting the cache state now gives consistent results - there is no phantom foo entry in a weird state.

The wrapper objects used to guarantee atomic add are pre-added to the cache to improve performance. I know pushing out entries isn't ideal, but other approaches are slower, so I would prefer to go with this approach.

Per Rovegård · Answer 3 · Fri Jan 13 2023 17:32:08 GMT+0800 (China Standard Time)

Thank you for the detailed explanation!

It makes sense that you want to solve it in a way that doesn't affect performance. However, for my own purposes, performance is not that important, but avoiding unnecessary cache eviction is. Thus, it's perfectly fine for me to not use atomic GetOrAdd and instead introduce a separate (striped) lock to guard insertion, so unless you change your mind I think that is the approach I will take.

Alex Peck · Answer 4 · Sat Jan 14 2023 04:51:21 GMT+0800 (China Standard Time)

Keep in mind that repeated exceptions for infrequent keys cannot starve the cache, for LFU they will never make it past the Window segment, which by default is 1% of the total cache size (I wrote a brief description of the LFU internals here, the Caffeine docs and TinyLFU paper have more details).

For example, if the LFU cache size is 100, the window segment is size 1 and the empty exception items can only use 1 cache slot, leaving the other 99 items untouched in the main segment of the cache. Using your example with my fix, this is what happens if there are repeated exceptions when the cache is size 3 (window segment is size 1):

using BitFaster.Caching.Lfu;
using BitFaster.Caching.Scheduler;

var cache = new ConcurrentLfuBuilder<string, bool>()
   .WithAtomicGetOrAdd()
   .WithScheduler(new ForegroundScheduler())
   .WithCapacity(3)
   .Build();

_ = cache.GetOrAdd("aa", _ => true);
_ = cache.GetOrAdd("bb", _ => true);
_ = cache.GetOrAdd("bb", _ => true);
_ = cache.GetOrAdd("cc", _ => true);
_ = cache.GetOrAdd("cc", _ => true);

try
{
    _ = cache.GetOrAdd("foo", s => throw new Exception(s));
}
catch { /* ignore */ }

try
{
    _ = cache.GetOrAdd("bar", s => throw new Exception(s));
}
catch { /* ignore */ }

try
{
    _ = cache.GetOrAdd("baz", s => throw new Exception(s));
}
catch { /* ignore */ }

// only the Window item has been evicted
Console.WriteLine(cache.Count);                          // prints: 2
Console.WriteLine(string.Join(", ", cache.Keys));        // prints: cc, bb

This is basically a sequential scan of one-off lookups (foo, bar, baz) - the LFU will anyway defend against caching infrequent items and keep the more frequent bb and cc.

If you have more frequent access to foo than bb or cc, the empty foo item can push out the valid items since it will be promoted into the main cache segment. But if the other items are infrequently accessed, they are of little value anyway. In practice, the frequency sketch will do a good job of keeping the hottest items alive. If there are hot exception items, the subsequent lookups have a chance of initializing the value to recover.

For me, a small amount of cache pollution in the exception case is an OK tradeoff to preserve maximum throughput/lowest latency.