grafana / phlare

🔥 horizontally-scalable, highly-available, multi-tenant continuous profiling aggregation system

Home Page:https://grafana.com/oss/phlare/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cleanup blocks has unexpected behaviour when using multiple tenats

simonswine opened this issue · comments

As we currently do not query from object store at all, we rely on the local disk to query blocks from. In order to avoid disks filling up we implemented a clean up method that becomes effective if the disk has high utilization:

https://github.com/simonswine/phlare/blob/4b3c7f639b61c51cf9c846a6c4dea5913deea758/pkg/phlaredb/phlaredb.go#LL181C6-L181C6

As this method is run once per tenant it will not handle certain cases correctly:

  • An old tenant that no longer receives traffic (and hence has no active instance of PhlareDB cleaning up old blocks, will never get cleaned up.
  • With multiple tenants it will delete the oldest blocks each, which could mean for a just onboarded tenant it will even delete a very recent block

Ideas

I am not too opinionated how we solve that but we could:

  • Implement a configured retention period where we only ever delete blocks older than it.
  • Move the cleanup loop outside of PhlareDB and run it once for all tenants. Only cleanup the oldest blocks across tenants until the disk is no longer in HighDiskUtilization.