Prefect cloud is sending false-positive alerts on queue being unhealthy
jayhack opened this issue · comments
First check
- I added a descriptive title to this issue.
- I used the GitHub search to find a similar issue and didn't find it.
- I searched the Prefect documentation for this issue.
- I checked that this issue is related to Prefect and not one of its dependencies.
Bug summary
As users of Prefect Cloud v2, we have an automation set up to alert us, when a queue stays unhealthy for over 2 hours. Suddenly yesterday we have started to receive false-positive alerts from cloud that continues to happen to this time.
- Only single queue is affected
- The queue is polled by single agent
- The same agent polling queue reported to having issues also polls other queues that are not reporting problems
- There are no errors of any kind reported by the agent
- While the queue is reported to be "unhealthy" for over 2 hours by the automation
- ... the queue shows as healthy in Prefect UI
- ... agent reports no errors in its stdout of any kind
- ... we have hourly jobs scheduled in the queue reported to be unhealthy for over 2 hours. The flows are run without any issues
To us, it looks like Cloud is misfiring alerts and there actually are no issues with agent's connectivity
The automation is configured to send a appropriate slack message
Details
Account ID: c1397d5f-b9f3-49e8-abb6-bce7d7b1412e
Workspace ID: 32dfe242-315b-4405-b06d-8b6308d6b631
"Affected" queue ID: 09ce5e48-5dc7-47c6-9b2c-f07659024110
Example Slack alert from automation
❗️What is also curious, is that when notification arrives, the message's last polled
time the time of the message received! There is no 2 hour buffer as per configuration of the automation. (see above screenshot) - message was received at 19:18 CEST, while the notification states the last polled time was 18:18 UTC, which is the same time, if we correct for the timezone.
There have been absolutely no change on our side for some days, hence this is not a configuration related.
Reproduction
Reproduction of this may be difficult as only single queue is reporting issues and it seems to happen sporadically.
# In Prefect Cloud, automation trigger configured as such
Trigger Type: Work queue health
Work Queues: <queue 1>, <queue 2>
Work Queue: Stays in Unhealthy
For: 2 Hours
# Prefect agent
- Runs on an EC2 at AWS
- Ubuntu 22.04.2 LTS
- Prefect agent run via systemd
Error
There is no stack trace or any errors emitted into stdout of the prefect-agent.
Versions
$ prefect --version
2.7.11
$ python --version
Python 3.10.6
$ prefect diagnostics
Usage: prefect [OPTIONS] COMMAND [ARGS]...
Try 'prefect --help' for help.
╭─ Error ──────────────────────────╮
│ No such command 'diagnostics'. │
╰──────────────────────────────────╯
Additional context
So far we have received following messages as false positives (Time in UTC)
- 2023-03-20 18:18
- 2023-03-20 19:19
- ...
- 2023-03-20 22:23
- ...
- 2023-03-21 01:26
- ...
- 2023-03-21 04:29
- ...
- 2023-03-21 11:37
- 2023-03-21 12:38
For the sake of context, here is the initial problem description in Slack thread in Prefect community workspace.