HSF / prmon

Standalone monitor for process resource consumption

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overflowing network metrics

amete opened this issue · comments

It's reported that under certain circumstances, network metrics seem to be overflowing:

Time	[...]	rx_bytes	rx_packets	tx_bytes	tx_packets
[...]
1607450732	[...]	182018452734	148438951	4801778623	13500137
1607450793	[...]	18446743984687580690	18446744073623414305	18446744072751791668	18446744073703405569
[...]

The problem is reported for the following ATLAS PanDA task: https://bigpanda.cern.ch/job?pandaid=4914743947

The full prmon text file result can be found here and is also attached at [1].

[1] memory_monitor_output.txt.tar.gz

Hard to know for sure but one possibility might be an issue reading/parsing [1] the network_stats coupled w/ the logic at:

text_stats[if_param] =
network_stats[if_param] - network_stats_start[if_param];

under the condition network_stats[if_param] < network_stats_start[if_param] but that's just a guess.

Edit: [1] The device-level metrics might also be resetting in the middle of the job, causing this issue. If this is indeed the case, then we can reset the reference if the above condition is met or something to that extent.

Hmmm, all really weird - network_stats itself is unsigned long long, so should not care. Also we read directly into an unsigned long long to see how we could have hit any overflow.

unsigned long long value_read{};

Can it be that everything is positive definite but the metrics get reset at some point so the current measurements become smaller than the ones at the beginning of the job and the right-hand side of the above calculation goes negative?

I checked the kernel and I think the counters are also implemented as u64, so equivalent to unsigned long long. The max value is 18446744073709551615, quite close to what we see in the job. So, yes, it could be that this machine had been up for so long and passed so much network traffic that the counters overflowed. We need some strategy then for coping with this.

OK, I have realised that there is someting I really don't get - all of the values go bananas at the same time, even though the values for tx/rx and packets/bytes are all different and would not all overflow at the same time.

Could this have been caused by the system taking a network interface down and then bringing it back up (causing a counter reset) during the job?

Just to note that this is useful information on the kernel's stats structure:

https://www.kernel.org/doc/html/latest/networking/statistics.html

However, I'm struggling to find documentation on what happens if/when there's an overflow or under what circumstances the counters could be reset (note that a simple ifconfig DEV down/up does not reset the counters).

I just wanted to write a little here, as I didn't forget about this issue. However, let me say that it's really hard to deal with in general as we don't understand the conditions underwhich the kernel's network device counters get messed up.

I am also wary of putting in a fix for what could be one off events where the underlying system went bonkers, as that really could have unexpected side-effects.

First off, let me say that a back-of-the-envelope caculation indicates that a 64bit counter is almost never going to overflow. A machine that has an uptime of a year can pass traffic at more than 500GB/s and would still not overflow the bytes counter, so I think that's highly unlikely.

The difficulty is knowing when a measurement is reasonable and when it's not. The only reasonable thing I can think of is an algorithm like this:

  1. Measure network stats at startup
  2. Store stats as last-value
  3. Each cycle measure network stats as current-value
  4. Is current-value - last-value sane? e.g. (current-value > last-value) && (current-value - last-value < sanity-value)
  • If yes, increment prmon stat by this value
  • If no, print warning, do not increment stats
  1. store current-value as last-value
  2. Repeat till end of job

What do you think?

Something along the lines of this was my initial idea, so it makes perfect sense to me. I'd actually take it a step further and say the sanity-value should be 0 in item 4, at least naively. So, it would suffice to do (current-value >= last-value) and anything else should be considered suspicious.

I wonder if we should have some error counter, which we can store in the JSON file. It would make it easier to spot if this happens and help w/ post-processing (how many times this happened etc.).