Data is missing from time to time

Question

Data is missing from time to time

AlfaJackal opened this issue 4 years ago · comments

As you can see in the image, there is no data from time to time. Are you experiencing the same?

If yes: Do you know how to fix it? For normal graph visualization I can fill those gap. Unfortunately I don’t know how to close the gap on single stat panel.

Robert Jacob · Answer 1 · Fri Jun 19 2020 23:21:42 GMT+0800 (China Standard Time)

Hi @AlfaJackal , thanks for the message.

I have noticed this issue coming up more often now as well, but I have not looked into it yet. I'm planning to do some updates to the exporter soon and will probably also take a look at this issue while I am at that as well...

Robert Jacob · Answer 2 · Sun Jun 21 2020 22:44:09 GMT+0800 (China Standard Time)

I've added some more logging to the new version, don't know why I omitted the error log previously. You can run the new version as well if you want to have a look at the error.

Unfortunately (?) I did not have an error yet with the logging in place, so I don't have any information yet on why it fails.

AlfaJackal · Answer 3 · Sun Jun 21 2020 23:54:13 GMT+0800 (China Standard Time)

I am receiving these error messages now with the newest version! Sorry for posting a screenshot! Where do I find the log in the docker container?

Robert Jacob · Answer 4 · Mon Jun 22 2020 00:10:03 GMT+0800 (China Standard Time)

Sorry, apparently I forgot to wire up the last change correctly and did not test with an up-to-date build. This new issue should be fixed now.

The log is not written to any file, so you can not download the logs as a file from the container anywhere.

The screenshot seems to be from the Docker front-end of a Synology. There's an "export" button to the top-left of the log.

AlfaJackal · Answer 5 · Mon Jun 22 2020 01:15:57 GMT+0800 (China Standard Time)

First time I recognize that button. 😜
It is up and running again.

AlfaJackal · Answer 6 · Tue Jun 23 2020 20:41:12 GMT+0800 (China Standard Time)

Running it since 21st of June and until now I have had five errors with
ERRO Error getting data: Bad HTTP return code 500

Anything else I can provide you with?

Maybe this is related to Netatmo servers? It seems that they are down very often, but I cannot validate the times.

Robert Jacob · Answer 7 · Wed Jun 24 2020 08:28:15 GMT+0800 (China Standard Time)

I see the same error message. It's seems to be followed by the NetAtmo API returning old data (this triggers the "stale data" logs) until it fixes itself again.
I'll release another version which will also show the error message returned by NetAtmo soon, but I'm pretty sure the issue is on their end as we seem to be getting wrong data. I have also had an issue yesterday where the API did not return any data for my stations for a few hours.

-- edit: I have already been working on changing the behaviour of the exporter a bit so that it does cache the data internally for a while to reduce the load on the NetAtmo API if the query interval of Prometheus is not set to an extended value. This will probably also reduce the impact of this error.

Robert Jacob · Answer 8 · Sun Jun 28 2020 00:15:18 GMT+0800 (China Standard Time)

Unfortunately the responses from the Netatmo API only contained a JSON with the internal service error message encoded in it and no further information. As the errors seem to be more frequent around midnight (UTC+2) my guess is that something is producing additional load on the API during that time.

I've just merged the caching code into master, if you like you can also test this version. This will not "fix" the issue, as the cause is on the side of the Netatmo API itself, but it should make it less pronounced in the metrics, because the exporter will not try to fetch new data all the time and instead just use old data it already has (until the data is old enough to be considered "stale").

You can still track whether the updates work using the netatmo_up metric. The new netatmo_cache_updated_time should periodically increase showing when the data is actually updated.

AlfaJackal · Answer 9 · Sun Jun 28 2020 02:15:50 GMT+0800 (China Standard Time)

Awesome, will give it a shot! First thing I created in Grafana was a netatmo_up graph. 😉 And a netatmo_cache_updated_time stat card.

Robert Jacob · Answer 10 · Fri Jul 03 2020 23:01:11 GMT+0800 (China Standard Time)

There seems to be a large "drift" in the age of the sensor data provided by the Netatmo API during the night. netatmo_sensor_updated reads the timestamp of the data as provided from the API and as far as I understand this should ideally always be below 10min, as the data in the API is updated about every 10min (per their documentation).

I'm measuring >50min age of the sensor data during the night (GMT+2), though. I'm using this query to identify the drift between the time the cache was updated and the time of the sensor data (result in minutes):

avg(scalar(netatmo_cache_updated_time) - netatmo_sensor_updated) / 60

For me this produces a graph which increases every 12h with the peaks around 10:00 GMT and 23:00 GMT.

The exporter ignores old sensor data by discarding information where the age is larger than the "stale duration". I've previously set the default for this to 30min which seemed to work in the past years. I've increased the default to 60min now to account for the drift found in this week.

What bothers me is that this "age drift" should also be visible in the data itself (sensor values displaying old data), but I have not seen this yet, so my assumption is that there is some kind of caching bug in the Netatmo API itself. This would also be an explanation for the HTTP 500 results that are returned sometimes.

Can you tell me if you also have a similar age graph in your data and if the increase in the stale duration fixes the display issues? The stale duration is also a configuration option if the new default is still not enough.

AlfaJackal · Answer 11 · Sun Jul 05 2020 04:38:38 GMT+0800 (China Standard Time)

It seems that I have a very similar graph. Please have a look at mine, which is based on your query:

Robert Jacob · Answer 12 · Tue Jul 07 2020 00:31:31 GMT+0800 (China Standard Time)

For me the maximums of the graph are much higher, at least for the previous weeks. The last few days the times reported stay much more in the interval I would have expected (which is what I am seeing in your graph as well; everything below 15min).

I had another big spike in the time drift during 2020-07-05, starting at ~10:00GMT and ending at ~15:00.

Don't know if the exporter can do anything if the data it gets is bad. I wonder why the API reports such old updated times even though the data seems to be newer.

Do you have any other idea or is the current version working "good enough" for you?

AlfaJackal · Answer 13 · Fri Jul 17 2020 04:05:46 GMT+0800 (China Standard Time)

I think this version is good enough! Monitored it the last days and it looks pretty good so far in comparison to my other Netatmo Exporter for InfluxDB. Thank you!

One last question, a bit offtopic: How do you calculate netatmo_sensor_rain_amount_mm for 24h? It seems that the amount of mm is not comparable to those in netatmo app.

Robert Jacob · Answer 14 · Sun Aug 02 2020 19:05:31 GMT+0800 (China Standard Time)

Sorry, I forgot about the question regarding "rain". The collector does not do any calculation on the value returned from the Netatmo API (see here). Unfortunately for me, the Netatmo documentation is also not very detailed on the subject of what number is stored in that property (see here). I don't have a rain (or wind) sensor for myself and so can unfortunately not check for myself.

If you can provide me with some examples I can maybe come up with some improvements, maybe it's just simply the wrong value to use. I'd say this would be a discussion for a new issue though.