Overflowing network metrics
amete opened this issue · comments
It's reported that under certain circumstances, network metrics seem to be overflowing:
Time [...] rx_bytes rx_packets tx_bytes tx_packets
[...]
1607450732 [...] 182018452734 148438951 4801778623 13500137
1607450793 [...] 18446743984687580690 18446744073623414305 18446744072751791668 18446744073703405569
[...]
The problem is reported for the following ATLAS PanDA task: https://bigpanda.cern.ch/job?pandaid=4914743947
The full prmon
text file result can be found here and is also attached at [1].
Hard to know for sure but one possibility might be an issue reading/parsing [1] the network_stats
coupled w/ the logic at:
Lines 92 to 93 in 3cb5c22
under the condition network_stats[if_param] < network_stats_start[if_param]
but that's just a guess.
Edit: [1] The device-level metrics might also be resetting in the middle of the job, causing this issue. If this is indeed the case, then we can reset the reference if the above condition is met or something to that extent.
Hmmm, all really weird - network_stats
itself is unsigned long long
, so should not care. Also we read directly into an unsigned long long
to see how we could have hit any overflow.
Line 78 in 3cb5c22
Can it be that everything is positive definite but the metrics get reset at some point so the current measurements become smaller than the ones at the beginning of the job and the right-hand side of the above calculation goes negative?
I checked the kernel and I think the counters are also implemented as u64
, so equivalent to unsigned long long
. The max value is 18446744073709551615, quite close to what we see in the job. So, yes, it could be that this machine had been up for so long and passed so much network traffic that the counters overflowed. We need some strategy then for coping with this.
OK, I have realised that there is someting I really don't get - all of the values go bananas at the same time, even though the values for tx/rx and packets/bytes are all different and would not all overflow at the same time.
Could this have been caused by the system taking a network interface down and then bringing it back up (causing a counter reset) during the job?
Just to note that this is useful information on the kernel's stats structure:
https://www.kernel.org/doc/html/latest/networking/statistics.html
However, I'm struggling to find documentation on what happens if/when there's an overflow or under what circumstances the counters could be reset (note that a simple ifconfig DEV down/up
does not reset the counters).
I just wanted to write a little here, as I didn't forget about this issue. However, let me say that it's really hard to deal with in general as we don't understand the conditions underwhich the kernel's network device counters get messed up.
I am also wary of putting in a fix for what could be one off events where the underlying system went bonkers, as that really could have unexpected side-effects.
First off, let me say that a back-of-the-envelope caculation indicates that a 64bit counter is almost never going to overflow. A machine that has an uptime of a year can pass traffic at more than 500GB/s and would still not overflow the bytes
counter, so I think that's highly unlikely.
The difficulty is knowing when a measurement is reasonable and when it's not. The only reasonable thing I can think of is an algorithm like this:
- Measure network stats at startup
- Store stats as
last-value
- Each cycle measure network stats as
current-value
- Is
current-value - last-value
sane? e.g.(current-value > last-value) && (current-value - last-value < sanity-value)
- If yes, increment
prmon
stat by this value - If no, print warning, do not increment stats
- store
current-value
aslast-value
- Repeat till end of job
What do you think?
Something along the lines of this was my initial idea, so it makes perfect sense to me. I'd actually take it a step further and say the sanity-value
should be 0
in item 4, at least naively. So, it would suffice to do (current-value >= last-value)
and anything else should be considered suspicious.
I wonder if we should have some error counter
, which we can store in the JSON file. It would make it easier to spot if this happens and help w/ post-processing (how many times this happened etc.).