GRR Client Crashes panel (in AdminUI dashboard) does not work as expected

Question

GRR Client Crashes panel (in AdminUI dashboard) does not work as expected

tsehori opened this issue 4 years ago · comments

As mentioned in #787, the GRR Client Crashes panel, which is a part of the AdminUI dashboard, does the following: when GRR clients are not crashing, the graph shows a flat line. The problem is that, even when GRR clients do crash, the graph remains as a flat line, where we expect to see a "spike" that indicates that something goes wrong.

I suspect that the issue is not with the query implemented in #787, but that the client crash is not recorded well in the Prometheus Stats collector.

tsehori · Answer 1 · Thu Jul 02 2020 18:17:45 GMT+0800 (China Standard Time)

I managed to trigger the metric grr_client_crashes after:

Scheduling Timeline at '/'
Crashing the Fleetspeak GRR client
Restarting the Fleetspeak client again. This gives the following error: E0702 10:08:10.269989 1695972 system_service.go:250] Unable to get revoked certificate list: unable to retrieve file, last attempt failed with: failed with http response code: 404

The crash was triggered (both in the AdminUI and in the metric) only when I tried to start the Fleetspeak client again, not before.
@mbushkov what do you think? Is it by design and the issue can be closed, or should we further investigate?

Ben Galehouse · Answer 2 · Thu Jul 02 2020 18:21:13 GMT+0800 (China Standard Time)

The revoked certificate error is harmless - it is a error because it really shouldn't happen on a prod system, because there should at least be an empty revocation file in the database. But it causes no loss of functionality.

Come to think of it, we should maybe adjust the db schema code to add such a file on table creation, just to prevent this sort of confusion.

tsehori · Answer 3 · Thu Jul 02 2020 18:28:32 GMT+0800 (China Standard Time)

The revoked certificate error is harmless - it is a error because it really shouldn't happen on a prod system, because there should at least be an empty revocation file in the database. But it causes no loss of functionality.

Come to think of it, we should maybe adjust the db schema code to add such a file on table creation, just to prevent this sort of confusion.

@bgalehouse you're right, this error persists whether the client has crashed or it didn't, it not related to this issue. And that sounds like a good idea - surely I'm not the only one to get confused by it 😄

Ben Galehouse · Answer 4 · Thu Jul 02 2020 19:26:09 GMT+0800 (China Standard Time)

Regarding the problem in question, I'm actually quite surprised the restarting FS has an effect. In normal operations, after GRR dies, FS should restart it. Then the new GRR client decides if it should report a client crash - and it might only do so if the old client was in the middle of some client action when the client died.

So I suggest not restarting FS, just killing GRR during an operation and giving it a little time for FS to notice and restart. If this does not work, it would be interesting to see the resulting FS client log file.

tsehori · Answer 5 · Thu Jul 02 2020 19:35:09 GMT+0800 (China Standard Time)

The way I restarted FS is by killing the GRR FS client and then starting it again. The FS server was running the entire time. Is this what you mean?

Ben Galehouse · Answer 6 · Thu Jul 02 2020 19:58:06 GMT+0800 (China Standard Time)

There are two processes on the endpoint.

fleetspeak(d), which is run by init.d, written in go, part of google/fleetspeak, and which I usually think of as the fleetspeak client process
grr which is run by fleetspeak, written in python, part of google/grr, and what I usually think of as the grr client process, or the fleetspeak-enabled grr client.

From the error message, you restarted the first and were looking at its log file. My claim is that killing the second and letting the first restart it should work - if it doesn't, that is something to debug.

Killing the first and letting init.d restart it should also work - the newly restarted fleetspeak process will start a new grr process, and the old grr process will die. But there are more things to go wrong, so I'd worry about the first sequence first.

(Also, the first sequence is more common in practice - GRR is more likely to crash on its own than FS)

tsehori · Answer 7 · Thu Jul 02 2020 21:54:17 GMT+0800 (China Standard Time)

Ok I understand. All the comments before regarded me killing the fleetspeak(d) process; when I kill it, it does not restart the second process.

When I kill GRR which is run by FS, it is indeed immediately restarted by FS and the AdminUI is triggered with a client crash, so nothing to debug here. In all my previous attempts I killed the FS process alone; this is probably why I didn't manage to trigger the metric.

There is only one thing I'm wondering about - the metric grr_client_crashes is only triggered when the GRR client is crashing during an execution of a flow. That is, when I crash the GRR client when it has no flows executed, the metric is not triggered.
Is it intended by design, or something to look at?
Thank you for all the help @bgalehouse !

Ben Galehouse · Answer 8 · Thu Jul 02 2020 23:33:21 GMT+0800 (China Standard Time)

Regarding design intent, I think we limit reports to situation in which GRR was doing something in order to reduce false positives. If grr is just waiting for input and dies, it is more likely to be an ongoing system shutdown or similar than a bug. (This heuristic obviously isn't perfect, but it might be the best we have.)

Regarding what happens when you kill FS, it should restart GRR fairly quickly, and a log file if it does not would be interesting. In that case, the new GRR should then notice the breadcrumb left by the previous GRR process which indicates it was busy doing X and therefore we should report that GRR crashed while doing X. However, this might take longer and could be a more complex sequence to debug. (e.g., there might be race conditions if the old grr runs for a bit after the FS process dies)