panic: runtime error: index out of range in pyroscope/pkg/util/loser.(*Tree[...]).Winner(...)

Question

panic: runtime error: index out of range in pyroscope/pkg/util/loser.(*Tree[...]).Winner(...)

Mrucznik opened this issue a month ago · comments

Describe the bug

pyroscope_1 | panic: runtime error: index out of range [-1]
pyroscope_1 |
pyroscope_1 | goroutine 2007 [running]:
pyroscope_1 | github.com/grafana/pyroscope/pkg/util/loser.(*Tree[...]).Winner(...)
pyroscope_1 | github.com/grafana/pyroscope/pkg/util/loser/tree.go:99
pyroscope_1 | github.com/grafana/pyroscope/pkg/iter.(*TreeIterator[...]).Err(0x10?)
pyroscope_1 | github.com/grafana/pyroscope/pkg/iter/tree.go:25 +0x4d
pyroscope_1 | github.com/grafana/pyroscope/pkg/parquet.(*IteratorRowReader).ReadRows(0xc128eefbe0, {0xc024c19c08, 0x40, 0xc0029dd440?})
pyroscope_1 | github.com/grafana/pyroscope/pkg/parquet/row_reader.go:86 +0x172
pyroscope_1 | github.com/grafana/pyroscope/pkg/parquet.CopyAsRowGroups({0x3878b88, 0xc07db0bb60}, {0x386f920, 0xc128eefbe0}, 0x186a0)
pyroscope_1 | github.com/grafana/pyroscope/pkg/parquet/row_writer.go:32 +0xea
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb.(*profileStore).writeRowGroups(0xc031236b40, {0xc10592edc0?, 0x4f31140?}, {0xc0029dd350, 0x3, 0x484302d8?})
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb/profile_store.go:383 +0x21a
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb.(*profileStore).Flush(0xc031236b40, {0x3888f10, 0x4f31140})
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb/profile_store.go:186 +0x317
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb.(*Head).flush(0xc084d24140, {0x3888f10, 0x4f31140})
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb/head.go:563 +0x234
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb.(*Head).Flush(0xc084d24140, {0x3888f10, 0x4f31140})
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb/head.go:540 +0xd9
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb.(*PhlareDB).Flush.func1(0xc084d24140, 0x7f8509a673c8?)
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb/phlaredb.go:240 +0xb4
pyroscope_1 | github.com/samber/lo.Filter[...]({0xc10def6b00, 0x1, 0x7}, 0xc018bcbde0?)
pyroscope_1 | github.com/samber/lo@v1.38.1/slice.go:15 +0x9f
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb.(*PhlareDB).Flush(0xc003ddeb40, {0x3888f10, 0x4f31140}, 0x0, {0x28a5490, 0xc})
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb/phlaredb.go:238 +0x932
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb.(*PhlareDB).loop(0xc003ddeb40)
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb/phlaredb.go:188 +0x28f
pyroscope_1 | created by github.com/grafana/pyroscope/pkg/phlaredb.New in goroutine 1883
pyroscope_1 | github.com/grafana/pyroscope/pkg/phlaredb/phlaredb.go:137 +0x476

To Reproduce

Steps to reproduce the behavior:

Don't know, didn't investigate that much.

Expected behavior

No panic.

Environment

Infrastructure: bare-metal, docker on Linux 3.10.0-1062.9.1.el7.x86_64 CentOS Linux 7
Deployment tool: docker-compose

Additional Context

Pyroscope verison is grafana/pyroscope:1.12.0 run with -target=all. Mostly standard configuration (beside limits), using s3 as data store.

Anton Kolesnikov · Answer 1 · Fri Feb 14 2025 17:28:04 GMT+0800 (China Standard Time)

Hi @Mrucznik,

Thank you for reporting the issue! Could you please clarify whether the problem occurs persistently or if it happened for the first time and is not reproducible?

I suspect there might be a silent or unhandled filesystem failure which we do not handle correctly on our end. Could you please specify how the volume is mounted, which filesystem is being used, and whether there have been any issues, such as running out of space or similar problems (one indication of this could be the "cleaned files after high disk utilization" message in the log)?

Mrucznik · Answer 2 · Fri Feb 14 2025 21:06:50 GMT+0800 (China Standard Time)

The pyroscope was working without problems for about a month, panics started happening yesterday and happened 3 times so far. The filesystem is xfs.

This is my docker-compose configuration:

version: '3.3'

services:
  pyroscope:
    image: "grafana/pyroscope:1.12.0"
    ports:
      - "7070:4040" # http
      - "7071:9095" # grpc
    volumes:
      - ./config.yaml:/etc/pyroscope/config.yaml:ro
      - ./data:/var/lib/pyroscope
    command: ["-target=all", "-config.file=/etc/pyroscope/config.yaml"]

I didn't find "cleaned files after high disk utilization" in the logs.
Are you suggesting, that it could happen due to running out of space on the disk? Currently, I don't see a problem with that, but maybe when the docker container gets killed, the files get deleted. I will take a look at that.

Mrucznik · Answer 3 · Fri Feb 21 2025 18:40:40 GMT+0800 (China Standard Time)

Today I have found very high usage of RAM by pyroscope and some logs

(inuse space profile)

pyroscope_1  | ts=2025-02-21T10:27:43.162488147Z caller=compactor.go:646 level=error component=compactor component=compactor msg="failed to compact user blocks" tenant=anonymous err="compaction: group 0@17241709254077376921-merge--1740038400000-1740042000000: compact blocks [data-compactor/compact/0@17241709254077376921-merge--1740038400000-1740042000000/01JMH5XT1EWPPKJ5A2TVBVH2J7 data-compactor/compact/0@17241709254077376921-merge--1740038400000-1740042000000/01JMHADR4M3EE9B7EPVKZRR2J9]: compact blocks [data-compactor/compact/0@17241709254077376921-merge--1740038400000-1740042000000/01JMH5XT1EWPPKJ5A2TVBVH2J7 data-compactor/compact/0@17241709254077376921-merge--1740038400000-1740042000000/01JMHADR4M3EE9B7EPVKZRR2J9]: decoding page 19 of column \"Samples.list.element.StacktraceID\": decoding definition levels of data page v2: unexpected EOF"

Anton Kolesnikov · Answer 4 · Fri Feb 21 2025 19:24:48 GMT+0800 (China Standard Time)

Thanks for the details @Mrucznik!

The log message may indicate a file system error (I'd expect to see a CRC error in this case, however).

Could you please post the configuration you use and tell us more about the setup? If you could also specify what profilers you're using and the ingestion rate (the size and the number of profiles sent to pyroscope), that would be very helpful.

The profile fragment may suggest that overly large blocks get compacted, or too many stack traces are stored in blocks. In the meantime, without the full profile, I can't conclude anything – as far as I understand, this is a heap alloc_space profile collected over a period of time: sum of all allocations, included freed ones. You probably want to check inuse_space profile collected over a very short period – ideally, just a single profile.