grafana / phlare

🔥 horizontally-scalable, highly-available, multi-tenant continuous profiling aggregation system

Home Page:https://grafana.com/oss/phlare/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sort Rows by Series when flushing to disk

cyriltovena opened this issue · comments

Currently we don't sort across row groups of profiles when flushing all of them.

I believe we should be able to stream all row groups from

func (s *profileStore) writeRowGroups(path string, rowGroups []parquet.RowGroup) (n uint64, numRowGroups uint64, err error) {
and reorder them by SeriesID then timestamp.

This will improve data locality a ton, but I am a bit unsure how this will impact querying, as the order of querying will be:

  • Timestamp first then SeriesID

And blocks will be strictly stored in

  • SeriesID first then Timestamp

Currently the sorting is more like:

  • Within a single row group strictly: Series first then Timestamp
  • Across row groups, loosely timestamp ordered

If we query a time range only impacting a part of the ~3 hours within a block, we could get away by only reading pages that fall within the time ranges (based on the pages Min/Max). With this change we can only access pages by their SeriesIDs min/max. I don't expect a major issue, I am just wondering.

Fairly relevant to the change in #799

Maybe we need to address our query behaviour to use the order that is in the blocks as well.