Build process may fails when downloading OS packages
ggiguash opened this issue · comments
The download problems are intermittent, so we might want to attempt retrying the operation.
See this code as an example.
A typical stack trace for such a failure would be the following:
Traceback (most recent call last):
File "/usr/bin/osbuild", line 33, in <module>
sys.exit(load_entry_point('osbuild==118', 'console_scripts', 'osbuild')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/osbuild/main_cli.py", line 179, in osbuild_cli
manifest.download(object_store, monitor, args.libdir)
File "/usr/lib/python3.12/site-packages/osbuild/pipeline.py", line 418, in download
source.download(mgr, store, libdir)
File "/usr/lib/python3.12/site-packages/osbuild/sources.py", line 44, in download
reply = client.call_with_fds("download", args, fds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/osbuild/host.py", line 384, in call_with_fds
raise error
osbuild.host.RemoteError: RuntimeError: curl: error downloading http://mirror.siena.edu/centos-stream/9-stream/BaseOS/x86_64/os/Packages/systemd-252-33.el9.x86_64.rpm: error code 22
File "/usr/lib/python3.12/site-packages/osbuild/host.py", line 268, in serve
reply, reply_fds = self._handle_message(msg, fds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/osbuild/host.py", line 301, in _handle_message
ret, fds = self.dispatch(name, args, fds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/osbuild/sources.py", line 123, in dispatch
self.fetch_all(SourceService.load_items(fds))
File "/usr/lib/osbuild/sources/org.osbuild.curl", line 143, in fetch_all
for _ in executor.map(self.fetch_one, *zip(*amended)):
File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib64/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/osbuild/sources/org.osbuild.curl", line 190, in fetch_one
raise RuntimeError(f"curl: error downloading {url}: error code {return_code}")
We do retry and we have tests for this https://github.com/osbuild/osbuild/blob/main/sources/test/test_curl_source.py#L117 - maybe we need to improve our retry logic, one way would be to add a (dynamic) backoff dely between attempets, right now the retries are fairly aggressive afaict.
Note that the exit code 22
from curl
means that the remote server gave a 400
status: bad request
.
It would be weird for a server to respond with that transitively; perhaps rate limiting gone wrong?
Note that the exit code
22
fromcurl
means that the remote server gave a400
status:bad request
.It would be weird for a server to respond with that transitively; perhaps rate limiting gone wrong?
Note the original report on this came from a CI job, which runs multiple bib processes in parallel. So, even if we limit the rate of a single bib process, it may not help in total.
We do retry and we have tests for this https://github.com/osbuild/osbuild/blob/main/sources/test/test_curl_source.py#L117 - maybe we need to improve our retry logic, one way would be to add a (dynamic) backoff dely between attempets, right now the retries are fairly aggressive afaict.
Maybe something like (values are a bit of a strawman, max delay will be around 6s) and a bit of randomness because of the multi-process nature):
diff --git a/sources/org.osbuild.curl b/sources/org.osbuild.curl
index 6a8293f2..fd586815 100755
--- a/sources/org.osbuild.curl
+++ b/sources/org.osbuild.curl
@@ -21,9 +21,11 @@ up the download.
import concurrent.futures
import os
import platform
+import random
import subprocess
import sys
import tempfile
+import time
import urllib.parse
from typing import Dict
@@ -155,6 +157,7 @@ class CurlSource(sources.SourceService):
# redirected to a different, working, one on retry.
return_code = 0
arch = platform.machine()
+ backoff_sleep = 0.1
for _ in range(10):
curl_command = [
"curl",
@@ -186,6 +189,10 @@ class CurlSource(sources.SourceService):
return_code = curl.returncode
if return_code == 0:
break
+ # give the source a bit of time
+ time.sleep(backoff_sleep)
+ backoff_sleep *= 1.5 + random.uniform(-0.1, 0.1)
+ print(backoff_sleep)
else:
raise RuntimeError(f"curl: error downloading {url}: error code {return_code}")
diff --git a/stages/org.osbuild.kickstart b/stages/org.osbuild.kickstart
index 21d03ace..3a84c4b5 100755
--- a/stages/org.osbuild.kickstart
+++ b/stages/org.osbuild.kickstart
@@ -62,7 +62,7 @@ def make_users(users: Dict) -> List[str]:
key = opts.get("key")
if key:
- res.append(f'sshkey --username {name} "{key}"')
+ res.append(f'sshkey --username {name} "{key.strip()}"')
return res
would help in general.
In this particular case it may not of course if many bib processes hammer a single server and it starts to rate limit we would have to have a way to externally control the number of connections that we allow for curl.