Prefect cloud is sending false-positive alerts on queue being unhealthy

Question

Prefect cloud is sending false-positive alerts on queue being unhealthy

jayhack opened this issue a year ago · comments

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

As users of Prefect Cloud v2, we have an automation set up to alert us, when a queue stays unhealthy for over 2 hours. Suddenly yesterday we have started to receive false-positive alerts from cloud that continues to happen to this time.

Only single queue is affected
The queue is polled by single agent
The same agent polling queue reported to having issues also polls other queues that are not reporting problems
There are no errors of any kind reported by the agent
While the queue is reported to be "unhealthy" for over 2 hours by the automation
... the queue shows as healthy in Prefect UI
... agent reports no errors in its stdout of any kind
... we have hourly jobs scheduled in the queue reported to be unhealthy for over 2 hours. The flows are run without any issues

To us, it looks like Cloud is misfiring alerts and there actually are no issues with agent's connectivity

The automation is configured to send a appropriate slack message

Details

Account ID: c1397d5f-b9f3-49e8-abb6-bce7d7b1412e
Workspace ID: 32dfe242-315b-4405-b06d-8b6308d6b631
"Affected" queue ID: 09ce5e48-5dc7-47c6-9b2c-f07659024110

Example Slack alert from automation

❗️What is also curious, is that when notification arrives, the message's last polled time the time of the message received! There is no 2 hour buffer as per configuration of the automation. (see above screenshot) - message was received at 19:18 CEST, while the notification states the last polled time was 18:18 UTC, which is the same time, if we correct for the timezone.

There have been absolutely no change on our side for some days, hence this is not a configuration related.

Reproduction

Reproduction of this may be difficult as only single queue is reporting issues and it seems to happen sporadically.

# In Prefect Cloud, automation trigger configured as such
Trigger Type: Work queue health
Work Queues: <queue 1>, <queue 2>
Work Queue: Stays in Unhealthy
For: 2 Hours

# Prefect agent
- Runs on an EC2 at AWS
- Ubuntu 22.04.2 LTS
- Prefect agent run via systemd

Error

There is no stack trace or any errors emitted into stdout of the prefect-agent.

Versions

$ prefect --version
2.7.11

$ python --version
Python 3.10.6

$ prefect diagnostics
Usage: prefect [OPTIONS] COMMAND [ARGS]...
Try 'prefect --help' for help.
╭─ Error ──────────────────────────╮
│ No such command 'diagnostics'.   │
╰──────────────────────────────────╯

Additional context

So far we have received following messages as false positives (Time in UTC)

2023-03-20 18:18
2023-03-20 19:19
...
2023-03-20 22:23
...
2023-03-21 01:26
...
2023-03-21 04:29
...
2023-03-21 11:37
2023-03-21 12:38

For the sake of context, here is the initial problem description in Slack thread in Prefect community workspace.