mum4k / tc_reader

A persistent SNMP script that exports TC Queue and Class statistics for graphing (for example to Cacti).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Strange values after value out of range bug correction

Roxyrob opened this issue · comments

Hi mum4k,
I upgraded to the last commit you made to correct "value out of range" bug. Now I see a strange behavior. Before upgrade to new tc_reader I checked a parent class current bytes total is the exact sum of all its leaf classes current bytes and that leaf classes current bytes values is equal to the their relative leaf qdiscs.

After last commit upgrade I see those values ar unrelated. Can be 64 bit values correction you made in last commit ?

Hi Roxyrob,

It is certainly possible that by fixing one bug I created a new one. I have done that before 😀

Would you be able to collect data needed to reproduce the behavior? I.e. The corresponding tc qdisc and tc class outputs and describe what is the expected vs the actual parsed information?

That way we can easily catch it in a test case and fix it.

Jakub

Ok. I have 2 screenshot, not synchronized as values (as this is very difficult :D) but you can see 2 different data representation, one from tc_reader (graphed by cacti) and one taken directly from the sistem with tc command:

tc_aggregated_output (tc command)
tc_aggregated_output

You can verify QoS works well on the system, seeing the "tc_aggregated_output.jpg".

  • file. 1:1 is the sum of all as expected
  • every inner class (1:10, 1:20, 1:30, 1:40, 1:50) is sum of their leaf classes as bytes/packets as expected (I use 1:1x inner with 1:10x its leafs, 1:2x inner with 1:20x leafs, and so on).

cacti_aggregate_lan (cacti aggregated graph)
cacti_aggregate_lan
in "cacti_aggregate_lan.jpg" you can see that data is not consistent as:

  • 1:1 is the root class, under 1:0 qdisc so should count all packets but is not the sum of all
  • 1:10 is parent of 1:100, 1:101 so should be sum of these 2 classes and it is not
  • some as 1:20 with 1:200... end so on
  • also (not shown in this graph but if you need I make a new graph for this) I checked that ie for 1:200 and 200: (leaf class and its sqf qdisc) data are different and should not to be so

If I'm not wrong on these data I hope this can help you to understand the issue.

Thanks for providing the screenshots. To confirm that I am reading this right - class 1:10 shows the current rate of 13.23k while it should be the sum of classes 1:100 with 15.13M and 1:101 with 24.38k?

That is a large difference and thus it should be relatively easy to identify the issue. I have a couple of ideas, but to confirm which one it is I would ask you to provide few more data / perform some more experiments:

  1. Could you verify that this was indeed caused by the last change? Can you compile against the previous release and confirm that the problem disappears?
  2. Can we try to remove Cacti from the picture, to ease the troubleshooting? Would you be able to run tc_reader directly, identify the problem in its outputs and paste them here?
  3. Can you also provide the full output of "tc qdisc" and "tc class" from roughly the same time as you collect the other inputs? Doesn't have to show the same values.

Two things come to mind that could cause something like this:
a) we could be dealing with an invalid Cacti configuration where cacti reads the values into 32 bit ints. To confirm we would need to look at the counter values - are we under or over the 32 bit int when reading the counters?
b) the tc_reader runs two commands "tc qdisc" followed by "tc class", these happen one after each other so it is possible that there will be some difference between the two values read. However what you pasted seems to be showing differences all within the output of "tc class"

Please see if you can work on and provide (1), (2) and (3) from above.

Jakub

To start I send you 2 files for tc and snmpwalk run practically at the same time:

"tc.out.txt" = output of "tc -s qdisc show dev lan" and "tc -s class show dev lan"
"snmp.query.txt" = output of "snmpwalk -v2c -c comm hostname .1.3.6.1.4.1.2021.255"

Now that you point me to 64 bit counter, I see snmpwalk returning counter32, can be you (tc_reader) have to define OID counter as counter64 instead of counter32 so snmp can identify correctly ? Cacti is out for now, I'll need to find where to specify in cacti after this first low level check.

tc.out.txt
snmp.query.txt

I think you've got it right Roxyrob. Just looking at the first output for the root qdisc (1:0), the TC command outputs:
Sent 3301093859366

While the SNMP counter32 only has what I am assuming to be the overflown 32 bit value:
Counter32: 418011198

I need to do some background study to see if / how we can use a Counter64 instead.

uint64 instead of int64 ?

snmp.go
sentBytes uint64

parser.go
var sentBytes uint64

etc

Looking at the NET-SNMP documentation for persistent scripts:

http://net-snmp.sourceforge.net/docs/man/snmpd.conf.html#lbAZ

Specifically the "MIB-Specific Extension Commands":
"Note, The SMIv2 type counter64 and SNMPv2 noSuchObject exception are not supported."

So I think we have two options going forward:

  1. Look into enhancing the NET-SNMP code to also support the counter64 type from a persistent script.
  2. Update the tc_reader - let it parse 64bit values but consistently wrap them into 32bit ints for NET-SNMP. This would limit us to graphing bandwidth up to 100Mbps, it would start behaving strangely above that. But it could cover some use cases.

What do you think?

Correction, this doesn't seem to be true - NET-SNMP already does support counter64 and integer64, it was added quite some time ago:

haad/net-snmp@dca6c16

If I make this change in a development branch - would you be willing to compile and run this to verify that it actually works?

Can be the doc in your link outdated ?

I found these:
https://sourceforge.net/p/net-snmp/patches/737/
https://gist.github.com/FransUrbo/a2bfee606ffda0b7b81e

Probably we can try something like uint64 I found to be the Counter64 equivalent.
As soon as possible I'll do a try

Read now your last comment... Yes. Tomorrow I can do a try.

Perfect, I will make the change in a separate branch and post its name here once done.

Please take a look at #9 and test it locally if you have the time.

I will merge it in once you confirm that this still works.

Hi mum4k,
snmpwalk now seems ok:
tc.out_#9.txt

Cacti Graph are wrong yet. Do you know if it's necessary to change something for counter64 ?

Thank you for confirming that.

As far as I remember, Cacti supports Counter64 natively. However since these used to be counter32s maybe the data source needs to be recreated.

Can you try deleting the graph and all related data sources and creating them again? Let's see if that helps.

Ok, deleted and re-created.
Classes and relative leaf sfq qdisc now seems in sync (equal), but some inner classes and root class are wrong,
as you can see in the screenshot below 1:10 is not sum of 1:100, 1:101 and 1:1 is not at least equal to 1:10 (+ 1:20, etc.), values are wrong.

image

I see on snmpwalk what i write below. Note that lan:1:0 is not present on the graph. Can this make a shift on classes/qdisc values on data sources / graphs ? why there is no Data Source for lan:1:0 ?

UCD-SNMP-MIB::ucdavis.255.3 = STRING: "tcNameLeaf" UCD-SNMP-MIB::ucdavis.255.4 = STRING: "sentBytesLeaf"
UCD-SNMP-MIB::ucdavis.255.3.1 = STRING: "lan:1:0" UCD-SNMP-MIB::ucdavis.255.4.1 = Counter64: 166444281299
UCD-SNMP-MIB::ucdavis.255.3.25 = STRING: "lan:1:1" UCD-SNMP-MIB::ucdavis.255.4.25 = Counter64: 166444363653
UCD-SNMP-MIB::ucdavis.255.3.15 = STRING: "lan:1:10" UCD-SNMP-MIB::ucdavis.255.4.15 = Counter64: 130681617593
UCD-SNMP-MIB::ucdavis.255.3.14 = STRING: "lan:1:101" UCD-SNMP-MIB::ucdavis.255.4.14 = Counter64: 8091208
UCD-SNMP-MIB::ucdavis.255.3.16 = STRING: "lan:1:100" UCD-SNMP-MIB::ucdavis.255.4.16 = Counter64: 130673526385
UCD-SNMP-MIB::ucdavis.255.3.2 = STRING: "lan: 100:0" UCD-SNMP-MIB::ucdavis.255.4.2 = Counter64: 130673453473
UCD-SNMP-MIB::ucdavis.255.3.3 = STRING: "lan:101:0" UCD-SNMP-MIB::ucdavis.255.4.3 = Counter64: 8091208
...

I am sorry Roxyrob, but since I don't have a working Cacti setup I have limited options.

If there is anything else you can think of - please try to do your best and debug this in Cacti. You can turn on debug logging and see what value is actually read from the SNMP server and what is recorded into the RRDs. Feel free to post the logs here.

Also please try to investigate if there is a need to set something specifically to support 64 bit counters. Look at the existing templates and data sources that support 64 bit counters and come packaged with Cacti. I remember there is one for interface traffic. Try to cross-compare with our templates.

If all else fails I can try to set this up at home, once I get some free cycles.

One more thing occurred to me.

I think there is one more location where Cacti caches SNMP data, which might explain some of the inconsistencies you have reported.

In your Cacti console, go to Devices, then to the monitored device, then scroll down to "Associated Data Queries" and click the "Verbose Query" link.

You can also verify in the output after clicking this if all the classes are visible / detected.

I would try to again delete and recreate the graph and it's data source after this.

Please let me know if this helped and / or if you have found anything in Cacti logs or config.

Refreshed with Verbose Query. Now I can see 1:0 and all classes qdiscs seem ok. Also sum and bytes values.

Thank you mum4k !

Great news! Thank you for confirming this.

I will work on merging #9 in.