ovh / beamium

Prometheus to Warp10 metrics forwarder

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Race condition on file move/removal

babolivier opened this issue · comments

When running beamium to forward metrics from matrix-org/synapse to OVH Metrics, I get errors in beamium's logs. Debug logs lead me to think there's a race condition between two threads trying to move/remove the same file at the same time:

Oct 05 15:24:48.918 DEBG rotate tmp file to "sources/synapse-1507217088106757.metrics", scraper: synapse
Oct 05 15:24:48.923 DEBG rotate tmp file to "sources/synapse-1507217088106036.metrics", scraper: synapse
Oct 05 15:24:48.924 INFO fetch success, scraper: synapse
Oct 05 15:24:48.924 ERRO fetch fail: No such file or directory (os error 2), scraper: synapse
Oct 05 15:24:48.932 DEBG load file sinks/ovh-metrics-1507216980#634240071-1.metrics, sink: ovh-metrics
Oct 05 15:24:48.936 DEBG rotate tmp sink file to "sinks/ovh-metrics-1507217088#56024268-1.metrics"
Oct 05 15:24:48.939 DEBG rotate tmp sink file to "sinks/ovh-metrics-1507217088#56024268-1.metrics"
Oct 05 15:24:48.939 ERRO route fail: No such file or directory (os error 2)

The scraps and posts seems to work fine though (since moving and removing these files occur after these operations).

This has been first observed with rustc's nightly version from October 2nd and the logs above have been obtained after compiling it with the latest nightly as of today. Not sure if this is relevant, but the distro is Archlinux.

Just a quick update with stuff I observed today: I tried setting up beamium to monitor a server with the metrics from Prometheus's node-exporter. Since I didn't want to build beamium on the server, I downloaded a binary from the latest GitHub release and wrote a quick config file. To test it, I ran beamium in a screen (which I terminated afterwards) and it worked like a charm, without the slightest sign of a race condition.

Then I moved everything to /etc/beamium, created a dedicated user, and a systemd service. From then, whenever I start beamium (both with and without systemd), I can see the errors again. I have no idea on how it happened since literally nothing happened between my initial test and the second one. Of course, if someone can think of any info I could give to help understand why this happens, please ask.

Again, a quick update but this time I think I found why the race condition was happening: I moved beamium from /etc/beamium to /etc/beamiumx and it stopped happening. I identified where the problem was and will open a PR later today.