containers / prometheus-podman-exporter

Prometheus exporter for podman environments exposing containers, pods, images, volumes and networks information.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SIGSEGV on startup until user runs "podman system reset"

ruipin opened this issue · comments

I have a few Proxmox machines running a multitude of VMs, each with multiple users running various podman containers.

The VMs themselves are running Debian bookworm.

Each user runs prometheus-podman-exporter under a systemd service:

[Unit]
Description=Prometheus Podman Exporter
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

StartLimitIntervalSec=300
StartLimitBurst=3

[Service]
Restart=on-failure
RestartSec=30

ExecStart=/bin/bash -c ' \
    exec /opt/prometheus/podman-exporter/prometheus-podman-exporter \
    --web.listen-address "127.0.0.1:$((40000 + %U))" \
    --collector.enable-all \
'

[Install]
WantedBy=default.target

This service is run by the various podman user accounts using systemctl --user. This works perfectly on many of my machines, but one of them segfaults whenever I start the service on any VM and any user:

ts=2024-04-25T18:46:14.302Z caller=exporter.go:68 level=info msg="Starting podman-prometheus-exporter" version="(version=1.11.0, branch=, revision=1)"
ts=2024-04-25T18:46:14.302Z caller=exporter.go:69 level=info msg=metrics enhanced=false
ts=2024-04-25T18:46:14.302Z caller=handler.go:94 level=info msg="enabled collectors"
ts=2024-04-25T18:46:14.302Z caller=handler.go:105 level=info collector=container
ts=2024-04-25T18:46:14.302Z caller=handler.go:105 level=info collector=image
ts=2024-04-25T18:46:14.302Z caller=handler.go:105 level=info collector=network
ts=2024-04-25T18:46:14.302Z caller=handler.go:105 level=info collector=pod
ts=2024-04-25T18:46:14.302Z caller=handler.go:105 level=info collector=system
ts=2024-04-25T18:46:14.302Z caller=handler.go:105 level=info collector=volume
ts=2024-04-25T18:46:14.306Z caller=events.go:17 level=debug msg="starting podman event streamer"
ts=2024-04-25T18:46:14.306Z caller=events.go:20 level=debug msg="update images"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xfaf217]

goroutine 1 [running]:
github.com/containers/common/libimage.(*Runtime).ListImages(0x0, {0x1c9a700, 0xc0003cbbf0}, {0x0?, 0xc00022d740?, 0x4176e5?}, 0x30?)
        /opt/prometheus/podman-exporter/v1.11.0/vendor/github.com/containers/common/libimage/runtime.go:587 +0x97
github.com/containers/podman/v5/pkg/domain/infra/abi.(*ImageEngine).List(0xc000178000, {0x1c9a700, 0xc0003cbbf0}, {0x0?, {0x0?, 0x763b80?, 0x27d8680?}})
        /opt/prometheus/podman-exporter/v1.11.0/vendor/github.com/containers/podman/v5/pkg/domain/infra/abi/images_list.go:25 +0x174
github.com/containers/prometheus-podman-exporter/pdcs.updateImages()
        /opt/prometheus/podman-exporter/v1.11.0/pdcs/image.go:48 +0x7c
github.com/containers/prometheus-podman-exporter/pdcs.StartEventStreamer({0x1c8b720?, 0xc00007f740}, 0x1)
        /opt/prometheus/podman-exporter/v1.11.0/pdcs/events.go:21 +0x1d3
github.com/containers/prometheus-podman-exporter/exporter.Start(0x0?, {0x0?, 0x0?, 0x0?})
        /opt/prometheus/podman-exporter/v1.11.0/exporter/exporter.go:93 +0x5fb
github.com/containers/prometheus-podman-exporter/cmd.run(0x27ebd40?, {0xc00007f480?, 0x4?, 0x19c8839?})
        /opt/prometheus/podman-exporter/v1.11.0/cmd/root.go:53 +0x1c
github.com/spf13/cobra.(*Command).execute(0x27ebd40, {0xc0000400b0, 0x4, 0x4})
        /opt/prometheus/podman-exporter/v1.11.0/vendor/github.com/spf13/cobra/command.go:987 +0xaa3
github.com/spf13/cobra.(*Command).ExecuteC(0x27ebd40)
        /opt/prometheus/podman-exporter/v1.11.0/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
        /opt/prometheus/podman-exporter/v1.11.0/vendor/github.com/spf13/cobra/command.go:1039
github.com/containers/prometheus-podman-exporter/cmd.Execute()
        /opt/prometheus/podman-exporter/v1.11.0/cmd/root.go:61 +0x1e
main.main()
        /opt/prometheus/podman-exporter/v1.11.0/main.go:8 +0xf

The VMs are running podman 4.3.1.

I have tried both prometheus-podman-exporter v1.10.1 and v1.11.0. The packages were built using make clean && make binary using the debian golang package provided by bookworm-backports for go 1.21.8.

This happens no matter how many / which containers are running, and even if no containers are running. The segfault also happens independently of the user (even users who have not ever run a podman container), except for root - in which case everything works fine.

The proxmox machines, VMs and containers are for the most part provisioned through ansible, and as such the VMs should be virtually the same, just running different containers.

The issues go away once each user runs podman system reset once.

Apologies if this isn't enough information, but I have spent quite a bit of time trying to figure out what is wrong, and am wondering if you might have any ideas. I am happy to provide any extra information.

Funnily enough, immediately after writing this I realised that running podman system prune on each individual VM/user affected seems to fix this issue. Rubber ducky debugging in action 😅

I suspect at some point something broke something internal to podman that this exporter relies on. Whatever this was affected all users and VMs provisioned on the same day, so presumably was related to my ansible script.

Leaving this issue open so that you can see it and maybe become aware that there might be some issue here, but feel free to close this issue if there's not enough information to reproduce.

Just tried to provision a new user/service, and encountered this again.

It seems this is reproduceable for any newly created user on this specific machine, at least until I run podman system reset once.

Hi @ruipin

Have you tried to build from the master branch ?
As far as I know podman v4 is not officially supported on debian.

Can u also attach podman system info output

I'll give the master branch a try (v1.12.0-dev).
Podman v4 is provided for Debian bookworm in the official debian repositories. I don't think it is unsupported? I've never had any issues with it, but maybe I'm missing something?

I've attached my podman system info output for one of the users:
podman-system-info.txt

Hi @ruipin
I've tried podman 4.3.1 under CentOS 8 stream but not facing your issue.
Going to try on debian as well and will update you soon.

Can you run prometheus-podman-exporter binary directly intead of systemd service for new user and see if you are facing the issue ?

Regards

Yes, I see the same crash even if I run directly, it does not matter if I run it through systemd, or the address/port I use, or whether I enable debug logging.

It is unfortunately not an easy one to reproduce, sorry.

It always happens when I create a new user until I do podman system reset, but I've only had it happen once on an existing user (that was previously working).

Hi @ruipin

I've installed Debian 12 (bookworm) on libvirt KVM with podman 4.3.1 and compiled the exporter (main branch) using golang 1.21.20 but still cannot simulate your issue even after creating a new user.

Are you using specific containers.conf for the new users or its a default one ?

I am using a custom /etc/containers/containers.conf:

[containers]
# Use the k8s-file driver
# The default journald driver isn't a good fit for rootless containers and is discouraged by RedHat https://access.redhat.com/solutions/7009652
# With it, we can't use rootless 'podman logs' command
#log_driver = "k8s-file"

# Use the journald driver, so that we can see stdout/err when a container fails on start
# by running e.g. "sudo journalctl _SYSTEMD_USER_UNIT=<pod>:<container>.service"
log_driver = "journald"

# We can use the journald driver for events, as those are targeted at root
events_logger = "journald"

[networks]
# Change the default network range here
default_network = "podman"
default_subnet = "10.255.255.0/24"

I also use a custom /etc/containers/storage.conf:

[storage]
# Define the storage driver. We probably want 'overlay' or 'zfs'
driver = "overlay"

# Default paths, need to be explicitly set
runroot = "/run/containers/storage"
graphroot = "/var/lib/containers/storage"

I have tried creating a new user and doing the various steps of podman system reset manually until it works.

It seems stopping all containers, deleting /run/user/990/libpod, and then restarting the containers fixes the issue.

A friendly reminder that this issue had no activity for 30 days.