Server metrics and monitoring

Question

Server metrics and monitoring

steinbro opened this issue 9 months ago · comments

Daniel W. Steinbrook commented 9 months ago

In addition to serving tiles, the backend also exposes two additional API endpoints:

/metrics that reports some performance and usage stats in Prometheus format
/probe/alive that can be pinged to monitor uptime

We're not utilizing either of these, but we should be. @RDMurray, do you have any favorite dashboard or monitoring tools? Not talking about full-blown error monitoring like Sentry, just something that can render the metrics histograms and something else that can fire off an email when the heartbeat isn't responding.

Actually, now that I look at the metrics, it does report that the alive endpoint has been pinged a few thousand times since it was spun up a few weeks ago -- is this being polled by something?

RDMurray · Answer 1 · Fri Nov 03 2023 22:13:32 GMT+0800 (China Standard Time)

I don't have any favourite monitoring tools, but I'll look into it. Sentry does have a generous free plan for open source organizations, so that might be worth looking into. Having said that I know very little about Sentry apart from that it is popular. I have no sight at all, so I can't really give an opinion on software that renders histograms or helps to visualise data.

I'm currently monitoring the alive endpoint with Uptime Robot, which sends an email if it is down. It also has a mobile app with push notifications.

Daniel W. Steinbrook · Answer 2 · Sat Nov 04 2023 23:56:25 GMT+0800 (China Standard Time)

Thanks. For Uptime Robot, is there a way to add more users to the account, or is it easiest to just create my own monitor if I want notifications? It also looks like you can create a basic uptime page for free as well; should we do so, at least for our own maintenance purposes?

I did try tinkering with Grafana Cloud, which also provides some customizable alerts, but the documentation states "Grafana Cloud won’t accept a public URL that is not protected by authentication," ostensibly for some security reason. I'd rather keep our metrics page open, though I suppose we could have a second password-protected URL if we really wanted to make Grafana Cloud happy. But I suppose visual dashboards wouldn't be super useful for this crowd, anyway.

RDMurray · Answer 3 · Sun Nov 05 2023 23:22:01 GMT+0800 (China Standard Time)

The free tier of Uptime Robot doesn't allow adding users. Hopefully it will allow you to create a monitor for the same site. Even if they don't allow that, it might work anyway because I am still monitoring newprod0.openscape.io.

There is an open source uptime monitoring service Upptime which uses Github actions and Issues. It can poll every 5 minutes. I find the idea of spinning up a VM every 5 minutes just to do an http request kind of horrible, but presumably Github is okay with it so we could possibly use that.

Daniel W. Steinbrook · Answer 4 · Sat Nov 18 2023 20:31:39 GMT+0800 (China Standard Time)

@RDMurray What are your thoughts on Glitchtip? It uses the Sentry API but has a much simpler UI. It also has a hosted free tier.

RDMurray · Answer 5 · Tue Nov 28 2023 23:06:57 GMT+0800 (China Standard Time)

I have played with Glitchtip a bit, making a test project and sending some events and metrics. It certainly is a much simpler UI. Issues are logged in detail.

The performance monitoring seems to be very simplistic though. I can only see the number of events and average duration, With a screen reader that is, I don't know if there is a graph.

The free tier is only 1000 events per month and I can't see any additional offers for open source projects.

I think the sentry open source free tier is to good to pass up, provided there are no shostoppers with the UI.

RDMurray · Answer 6 · Wed Nov 29 2023 00:54:54 GMT+0800 (China Standard Time)

I also created a test project on sentry.io. It is much more comprehensive, and looks very accessible so far. Once we have a dashboard or two and some alerts set up, it should be quite usable.

RDMurray · Answer 7 · Thu Nov 30 2023 21:10:16 GMT+0800 (China Standard Time)

Related to this issue, I set up an uptime monitoring service Uptime Kuma at uptime.mur.org.uk which can currently be accessed by the team @soundscape-community/backend . There is a public status page at soundscape-status.mur.org.uk.

It is self-hosted, but simple enough to manage and not a critical service.

Let me know what you think.

Daniel W. Steinbrook · Answer 8 · Fri Dec 01 2023 21:10:05 GMT+0800 (China Standard Time)

Nice! Together with the Slack integration, I'm satisfied with this level of monitoring for the tile service. I'd say the next priority is monitoring the ingest service (#71), and since you suggested using Sentry for that which you mentioned here I'll leave this issue open.