osbuild / osbuild

Build-Pipelines for Operating System Artifacts

Home Page:https://www.osbuild.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build process may fails when downloading OS packages

ggiguash opened this issue · comments

The download problems are intermittent, so we might want to attempt retrying the operation.
See this code as an example.

A typical stack trace for such a failure would be the following:

Traceback (most recent call last):
  File "/usr/bin/osbuild", line 33, in <module>
    sys.exit(load_entry_point('osbuild==118', 'console_scripts', 'osbuild')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/osbuild/main_cli.py", line 179, in osbuild_cli
    manifest.download(object_store, monitor, args.libdir)
  File "/usr/lib/python3.12/site-packages/osbuild/pipeline.py", line 418, in download
    source.download(mgr, store, libdir)
  File "/usr/lib/python3.12/site-packages/osbuild/sources.py", line 44, in download
    reply = client.call_with_fds("download", args, fds)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/osbuild/host.py", line 384, in call_with_fds
    raise error
osbuild.host.RemoteError: RuntimeError: curl: error downloading http://mirror.siena.edu/centos-stream/9-stream/BaseOS/x86_64/os/Packages/systemd-252-33.el9.x86_64.rpm: error code 22
   File "/usr/lib/python3.12/site-packages/osbuild/host.py", line 268, in serve
    reply, reply_fds = self._handle_message(msg, fds)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/osbuild/host.py", line 301, in _handle_message
    ret, fds = self.dispatch(name, args, fds)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/osbuild/sources.py", line 123, in dispatch
    self.fetch_all(SourceService.load_items(fds))
  File "/usr/lib/osbuild/sources/org.osbuild.curl", line 143, in fetch_all
    for _ in executor.map(self.fetch_one, *zip(*amended)):
  File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib64/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/osbuild/sources/org.osbuild.curl", line 190, in fetch_one
    raise RuntimeError(f"curl: error downloading {url}: error code {return_code}")

We do retry and we have tests for this https://github.com/osbuild/osbuild/blob/main/sources/test/test_curl_source.py#L117 - maybe we need to improve our retry logic, one way would be to add a (dynamic) backoff dely between attempets, right now the retries are fairly aggressive afaict.

Note that the exit code 22 from curl means that the remote server gave a 400 status: bad request.

It would be weird for a server to respond with that transitively; perhaps rate limiting gone wrong?

Note that the exit code 22 from curl means that the remote server gave a 400 status: bad request.

It would be weird for a server to respond with that transitively; perhaps rate limiting gone wrong?

Note the original report on this came from a CI job, which runs multiple bib processes in parallel. So, even if we limit the rate of a single bib process, it may not help in total.

We do retry and we have tests for this https://github.com/osbuild/osbuild/blob/main/sources/test/test_curl_source.py#L117 - maybe we need to improve our retry logic, one way would be to add a (dynamic) backoff dely between attempets, right now the retries are fairly aggressive afaict.

Maybe something like (values are a bit of a strawman, max delay will be around 6s) and a bit of randomness because of the multi-process nature):

 diff --git a/sources/org.osbuild.curl b/sources/org.osbuild.curl
index 6a8293f2..fd586815 100755
--- a/sources/org.osbuild.curl
+++ b/sources/org.osbuild.curl
@@ -21,9 +21,11 @@ up the download.
 import concurrent.futures
 import os
 import platform
+import random
 import subprocess
 import sys
 import tempfile
+import time
 import urllib.parse
 from typing import Dict
 
@@ -155,6 +157,7 @@ class CurlSource(sources.SourceService):
             # redirected to a different, working, one on retry.
             return_code = 0
             arch = platform.machine()
+            backoff_sleep = 0.1
             for _ in range(10):
                 curl_command = [
                     "curl",
@@ -186,6 +189,10 @@ class CurlSource(sources.SourceService):
                 return_code = curl.returncode
                 if return_code == 0:
                     break
+                # give the source a bit of time
+                time.sleep(backoff_sleep)
+                backoff_sleep *= 1.5 + random.uniform(-0.1, 0.1)
+                print(backoff_sleep)
             else:
                 raise RuntimeError(f"curl: error downloading {url}: error code {return_code}")
 
diff --git a/stages/org.osbuild.kickstart b/stages/org.osbuild.kickstart
index 21d03ace..3a84c4b5 100755
--- a/stages/org.osbuild.kickstart
+++ b/stages/org.osbuild.kickstart
@@ -62,7 +62,7 @@ def make_users(users: Dict) -> List[str]:
 
         key = opts.get("key")
         if key:
-            res.append(f'sshkey --username {name} "{key}"')
+            res.append(f'sshkey --username {name} "{key.strip()}"')
 
     return res
 

would help in general.

In this particular case it may not of course if many bib processes hammer a single server and it starts to rate limit we would have to have a way to externally control the number of connections that we allow for curl.