any progress on bucket-level expiry

Question

any progress on bucket-level expiry

macmarcdhas opened this issue 3 years ago · comments

Hi,

We are using riak 3.0.4 multi-cluster deployment. We are currently having 3 buckets for various functional usecases, and would like to maintain custom expiration for these buckets and not as global level.

As per doc:
https://riak.com/posts/technical/riak-kv-2-2-release-highlights/index.html?p=13017.html

We have started development on bucket-level expiry which is expected to be available in a future release of Riak KV.

Any progress on this?

Martin Sumner · Answer 1 · Thu Dec 02 2021 21:38:19 GMT+0800 (China Standard Time)

The original idea about how to implement this, hit some issues when trying to expire concurrently from AAE stores as well as backend stores. Essentially it was hard to stop there being a performance spike after AAE caches were rebuilt. The work then got shelved.

I have an alternative approach in mind now, to explicitly exclude buckets from AAE caches where those buckets have an expiry. This will mean no AAE on these buckets, as we consider them to be temporary. This work is not currently prioritised though.

Can you explain a bit more about your cluster - node count, backend, object counts etc. Would not having AAE for expiring buckets be an issue for you? If you use AAE at the moment, is it the original hashtree-based AAE or the new tictactree-based version? What sort of expiry times are you looking for (i.e. hours, days, months, or years?).

If I know there's a willing user, and I think there's a relatively easy to implement solution - then I'm happy to consider re-prioritising this again.

Martin Sumner · Answer 2 · Thu Dec 02 2021 23:17:29 GMT+0800 (China Standard Time)

Some other questions:

what is the rate at which objects would expire? (for objects with a relatively slow/steady rate of expiry there may be an alternative AAE-friendly root using riak_kv_eraser).
is it preferred that the expiry time be absolute from insert, or relative to the last time modified?

Mohammed Shahnawaz · Answer 3 · Fri Dec 03 2021 00:50:20 GMT+0800 (China Standard Time)

The original idea about how to implement this, hit some issues when trying to expire concurrently from AAE stores as well as backend stores. Essentially it was hard to stop there being a performance spike after AAE caches were rebuilt. The work then got shelved.

I have an alternative approach in mind now, to explicitly exclude buckets from AAE caches where those buckets have an expiry. This will mean no AAE on these buckets, as we consider them to be temporary. This work is not currently prioritised though.

Can you explain a bit more about your cluster - node count, backend, object counts etc. Would not having AAE for expiring buckets be an issue for you? If you use AAE at the moment, is it the original hashtree-based AAE or the new tictactree-based version? What sort of expiry times are you looking for (i.e. hours, days, months, or years?).

If I know there's a willing user, and I think there's a relatively easy to implement solution - then I'm happy to consider re-prioritising this again.

Thanks for the quick response,
Our use case is pretty simple, the bucket is not CRDT and our retention period is for 7 Days max. We do have CRDT buckets also, but their retention period is for months.

Mohammed Shahnawaz · Answer 4 · Fri Dec 03 2021 00:52:02 GMT+0800 (China Standard Time)

Some other questions:

what is the rate at which objects would expire? (for objects with a relatively slow/steady rate of expiry there may be an alternative AAE-friendly root using riak_kv_eraser).

is it preferred that the expiry time be absolute from insert, or relative to the last time modified?

Ok to consider insert time as point of reference.
Slow rate will also solve the purpose.

Martin Sumner · Answer 5 · Fri Dec 03 2021 01:13:54 GMT+0800 (China Standard Time)

What backend are you using - bitcask, memory, leveldb, leveled? Do you currently enable anti-entropy e.g.

anti_entropy = active

or

tictacaae_active = active

Mohammed Shahnawaz · Answer 6 · Fri Dec 03 2021 01:15:31 GMT+0800 (China Standard Time)

What backend are you using - bitcask, memory, leveldb, leveled? Do you currently enable anti-entropy e.g.

anti_entropy = active

or

tictacaae_active = active

Bitcask with anti_entropy = active

Martin Sumner · Answer 7 · Fri Dec 03 2021 01:40:34 GMT+0800 (China Standard Time)

@martincox - is there anything you have done with per-bucket expiry in a bitcask/aae setup?

Martin Cox · Answer 8 · Fri Dec 03 2021 13:23:56 GMT+0800 (China Standard Time)

So we had added per-key expiry into bitcask, but it is only currently utilised as part of the delete path. The absolute expiration timestamp is encoded as part of the key - determined by an arbitrary value, defined in seconds, that is added to the insert timestamp of the tombstone. At the point of deletion, the keys are removed from the AAE store as with a normal delete. During a bitcask merge, the expiry timestamps are inspected and merged out where less than current time.

Within bitcask, there are options (not_found_expiring and not_found_expired) to control the visibility of KVs that are pending deletion via expiry - which prevents an AAE rebuild from re-inserting a bunch of deleted keys. These options are also used in fallback vnodes to preserve expired objects until handing off to the owner.

We talked about extending this to bucket-level properties, but didn't have the use-case for it, so work was never picked up. Although, I think it'd be fairly trivial to plumb it in.

Martin Sumner · Answer 9 · Fri Dec 03 2021 17:48:43 GMT+0800 (China Standard Time)

Not sure I understand how it is working.

Is this more about reaping tombstones than actually expiring keys? So you're running in keep mode, but add a special hidden timestamp to the tombstone so that eventually it is removed (via bitcask merge), rather than staying there forever?

The AAE issue I struggled with is with rebuilds. We don't want rebuilds to be co-ordinated, but is one vnode rebuilds in a preflist and the rebuild fold now no longer includes a lot of entries that were lost during merges - doesn't that lead to a lot of AAE mismatches at the next exchange? Do you accept this, and change the read repair process so that it reaps the expired tombstones rather than re-replicate them?

Or have I got the wrong end of the stick?