xrmx / bootchart

merge of bootchart-collector and pybootchartgui

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does not produce a correct bootchart when kernel is built with CONFIG_TASKSTATS=n

Max-E opened this issue · comments

The README claims that bootchart2 will work with CONFIG_TASKSTATS=n, just not as well. However, when I try to generate charts on systems with CONFIG_TASKSTATS=n, I get a variety of wrong behaviors, depending on the version of bootchart2.

This is relevant when trying to instrument embedded Linux images generated by Yocto. See Yocto bug 5893:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=5893
Yocto can be configured to enable CONFIG_TASKSTATS, but it's off by default, so ideally we'd like bootchart2 to at least work partially without CONFIG_TASKSTATS.

The bug seems to go back a very long way. On versions starting with 0.14.2 all the way up through current master, the bootchart is completely blank. Example: (very tall image, just scroll past it to read the rest)
bootchart_master_notask

On versions from 0.12.3 up through 0.14.0, the bootchart shows only the first five seconds, which is not correct. Example: (very tall image, just scroll past it to read the rest)
bootchart_0 12 3_notask

In versions prior to 0.12.3, a very small (~20 KB) bootchart.tgz is created, and pybootchartgui complains that it does not contain a valid bootchart.

Meanwhile, in some versions, there are other bugs that seem to have come and gone but which stop me from getting proper test results.

For reference, a correct boot chart for the Yocto image I'm testing looks like this:
bootchart_0 14 0_withtask

I am able to get correct boot charts for all versions tested if CONFIG_TASKSTATS is enabled.

Complete results:

bootchart2 version CONFIG_TASKSTATS=y CONFIG_TASKSTATS=n
0.0.4 Not tried No bootchart.tgz created
0.1.0 Not tried Parse error: empty state: '/home/maxtothemax/bootchart_0.1.0_notask.tgz' does not contain a valid bootchart
0.11.0 Not tried Parse error: empty state: '/home/maxtothemax/bootchart_0.11.0_notask.tgz' does not contain a valid bootchart
0.11.4 Correct bootchart Parse error: empty state: '/home/maxtothemax/bootchart_0.11.4_notask.tgz' does not contain a valid bootchart
0.12.0 Not tried Won't compile due to missing common.h
0.12.1 Not tried Parse error: empty state: '/home/maxtothemax/bootchart_0.12.1_notask.tgz' does not contain a valid bootchart
0.12.2 Correct bootchart Parse error: empty state: '/home/maxtothemax/bootchart_0.12.2_notask.tgz' does not contain a valid bootchart
0.12.3 Not tried Only first five seconds shown
0.12.6 Correct bootchart Only first five seconds shown
0.14.0 Not tried Only first five seconds shown
0.14.1 Not tried Blank bootchart
0.14.2 Not tried Blank bootchart
0.14.3 Not tried "Can't extract boot chart from collector"
0.14.4 Correct bootchart Blank bootchart
0.14.5 Correct bootchart Blank bootchart
Current master Correct bootchart Blank bootchart

So as you can see, I could not find a single version of bootchart2 which actually produced a correct bootchart with CONFIG_TASKSTATS=n, out of any of the ones I tested.

Fixing this bug could be as simple as rewriting the README to indicate that CONFIG_TASKSTATS is mandatory, if you do not intend to support kernels without it.

I can get you bootchart.tgz and bootchart.png files for all of these tests if you want them.

One thing that would be useful to check is if the latest pybootchartgui is able to parse older than 0.14.0 tarballs to see if we have the first five seconds issue there too or so we regressed in the meantime. Could you please upload all these notask tarball somewhere please?

Sure, here's everything I have:
https://drive.google.com/file/d/0B_VvBSJusoCTcV9DZ3JuQ1VFZkE/edit?usp=sharing

EDIT: amended the first post, I'd accidentally used the wrong images.

So here's what i found:

  • at least in 0.14.4 not task tgz bootchart thinks that taskstat is available when it's not so taskstats data is collected but the end result is an empty file, that explains the blank bootchart.
  • the parse error empty state error of the early 0.12.x is because we have just one sample in a few file, don't know the root cause but it's ok that pybootchartgui fails if the data is incomplete
  • Haven't looked at the 5 seconds chart yet

So if you build bootchart-collector with this patch applied, what does calling it from a shell print with a kernel without taskstat?

diff --git a/collector/collector.c b/collector/collector.c
index 9c4a16a..855f6e8 100644
--- a/collector/collector.c
+++ b/collector/collector.c
@@ -464,6 +464,7 @@ static int get_family_id(int sd)
        if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
                id = *(__u16 *) NLA_DATA(na);
        }
+       printf("family id %d\n", id);
        return id;
 }

@@ -829,6 +830,9 @@ int main (int argc, char *argv[])
                return 0;
        }

+       printf("use taskstat: %d\n", init_taskstat());
+       return;
+
        if (enter_environment (args.console_debug))
                return 1;

The result is one line: use taskstat: 1

So init_taskstat is returning 1, and get_family_id is returning prior to the last line.

The reason for this problem is that the global variable netlink_taskstats_id is being declared as __u16, an unsigned type. So even if get_family_id returns a negative number to indicate an error, the check for netlink_taskstats_id > 0 on line 488 won't catch it.

I have written a patch for this bug and will submit it shortly. It seems to resolve all remaining issues.

I do get valid bootcharts without TASKSTATS now:
bootchart_patched
In fact, this chart seems more complete, because it includes the first five seconds or so. However, I am seeing a lot of extra pidof and bootchart-collector processes in this chart, compared to the ones with CONFIG_TASKSTATS=y. Is that by design?

In fact, this chart seems more complete, because it includes the first five seconds or so. However, I
am seeing a lot of extra pidof and bootchart-collector processes in this chart, compared to the ones
with CONFIG_TASKSTATS=y. Is that by design?

Nope, it's a bug :) That process should not appear in the graph, see elide_bootchart in pybootchartgui/draw.py

I don't think I'm really able to troubleshoot that at this time, due to time constraints. But you should be able to force get_family_id in collector.c to return 0 to simulate no TASKSTATS, so you can reproduce the problem yourself and troubleshoot it.

Sure i have the data i can reproduce and hopefully fix, i'd just like to make clear it's a bug :)

Moving specific issues with collector processes not merged to #56