UWIT-UE / am2alertapi

Prometheus alertmanager to UW alertAPI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Switch am2alertapi from Flask/gunicorn to Quart/hypercorn

EricHorst opened this issue · comments

The whole world and also am2alertapi have been stuck with a broken Flask/eventlet/gunicorn bug for a couple years. The gunicorn maintainer has not made a release in two years though the fix has been merged.

This issue is to switch from Flask with gunicorn to Quart with hypercorn.

am2alertapi isn't too high load but when alertmanager refreshes it does so all at once and thus am2alertapi needs to be responsive to bursts. In a not too severe but incident heavy situation there may be 10-20 active alerts which need to be updated at once. Since am2alertapi uses a sleep to ensure consistency of the remote side, it benefits from concurrency and async. A sleep is like being IO bound. Quart will be a good replacement to serve an IO bound workload.

Tasks:

  • switch am2alertapi from Flask to Quart using migration documentation
  • validate that am2alertpi still works
  • Contrive a load test and test am2alertpi to the extent possible. (am2alertpi-test on MCI talks to alerts-test and might be a good candidate)
  • Roll out am2alertapi slowly and ensure it works. prom01/02 and then MCI.

Eric's research summary about why to use Quart:

I read about async in Flask 2.0. Summary: it isn't really async. It's async under sync since it's still WSGI.

I read about FastAPI vs Flask. FastAPI looked sorta interesting. Then I read this which turned me off due to accusations of backlog of unmerged fixes and improvements to FastAPI.

Then I looked at Quart again and realized that in the last year it became part of Pallets which is the community driven organization that also oversees Flask. Thus Flask and Quart are now supported by the same community.

Quart is a ASGI reimplementation of the Flask API so it would be natural and low friction progression to a modern fast async solution. Quart uses Hypercorn by default but can use other ASGI servers. Hypercorn has lot of good features.

Based on this I'm feeling pretty good about using Quart to replace Flask and gunicorn to modernize my app and get out from under the gunicorn/eventlet mess.

Testing did not show any problems with new am2alertapi but attempting to run in production test on prom02 it would not send alerts. DNS resolution is apparently not happening or something. Investigation required.

Jul 10 16:39:06 prom02 conmon: [2023-07-10 23:39:06 +0000] [8] [ERROR] Error in ASGI Framework
Jul 10 16:39:06 prom02 conmon: Traceback (most recent call last):
Jul 10 16:39:06 prom02 conmon: File "/venv/lib/python3.10/site-packages/anyio/_core/_sockets.py", line 189, in connect_tcp
Jul 10 16:39:06 prom02 conmon: addr_obj = ip_address(remote_host)
Jul 10 16:39:06 prom02 conmon: File "/usr/local/lib/python3.10/ipaddress.py", line 54, in ip_address
Jul 10 16:39:06 prom02 conmon: raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
Jul 10 16:39:06 prom02 conmon: ValueError: 'api.alerts.s.uw.edu' does not appear to be an IPv4 or IPv6 address

Arielle reported having similar errors with httpx with eventlet and gunicorn, thus inidicating something in httpx. Others have reported that it works fine until containerized, maybe a problem with DNS resolution. See encode/httpx#2167

I was tryin to bring more certainty to am2alertapi but it's not there yet. It is broken as is.