ExaBGP restart & reload race condition

Question

ExaBGP restart & reload race condition

koef opened this issue a year ago · comments

Hello ExaBGP Team,
Firstly, I'd like to express my appreciation for your exceptional product.

We utilize Ansible for installing and configuring ExaBGP in our setups. Below is our Ansible 'exabgp' role:

---
- name: Install packages
  ansible.builtin.apt:
    name:
      - exabgp
    state: present

- name: Configure ExaBGP
  ansible.builtin.template:
    src: exabgp.conf.j2
    dest: /etc/exabgp/exabgp.conf
    mode: 0644
  notify: Reload ExaBGP

- name: Enable and start ExaBGP
  ansible.builtin.systemd:
    name: exabgp
    enabled: true
    state: started

And here's the handler 'Reload ExaBGP':

---
- name: Reload ExaBGP
  ansible.builtin.systemd:
    name: exabgp
    state: reloaded
    enabled: true

Unfortunately, we've noticed an issue. Our monitoring system detected that 'exabgp.service' was restarted: "Systemd's exabgp.service restarted 1 times on node1."

This issue can be reproduced using the 'systemctl restart exabgp && systemctl reload exabgp' command. On my Ubuntu 22.04, the result is as follows:

# systemctl restart exabgp && systemctl reload exabgp
Job for exabgp.service failed because a fatal signal was delivered to the control process.
See "systemctl status exabgp.service" and "journalctl -xeu exabgp.service" for details.

The journal log provides this information:

Aug 07 13:57:38 node1 systemd[1]: Starting ExaBGP...
Aug 07 13:57:38 node1 systemd[1]: Started ExaBGP.
Aug 07 13:57:38 node1 systemd[1]: Reloading ExaBGP...
Aug 07 13:57:38 node1 systemd[1]: Reloaded ExaBGP.
Aug 07 13:57:38 node1 systemd[1]: exabgp.service: Main process exited, code=killed, status=10/USR1
Aug 07 13:57:38 node1 systemd[1]: exabgp.service: Failed with result 'signal'.
Aug 07 13:57:38 node1 systemd[1]: exabgp.service: Scheduled restart job, restart counter is at 1.
Aug 07 13:57:38 node1 systemd[1]: Stopped ExaBGP.
Aug 07 13:57:39 node1 systemd[1]: Starting ExaBGP...
Aug 07 13:57:39 node1 systemd[1]: Started ExaBGP.
...

From the logs, the issue arises because exabgp doesn't have sufficient time to start before systemd sends 'USR1'.

To address this, we applied the workaround by overriding the default exabgp unit:

cat /etc/systemd/system/exabgp.service.d/override.conf
[Service]
ExecStartPost=/bin/sleep 2

We'd appreciate you letting us know if there's a better solution.

To Reproduce

Steps to reproduce the behavior:
systemctl restart exabgp && systemctl reload exabgp

Expected behavior

Reload command sent immediately after restarting the service doesn't lead to failing and restarting the service a second time.

Environment:

OS: Ubuntu 22.04.3 LTS
Version 4.2.17

Thomas Mangin · Answer 1 · Mon Sep 11 2023 22:12:21 GMT+0800 (China Standard Time)

Thank you for reporting this issue, I will try to look into it soon.

Thomas Mangin · Answer 2 · Mon Sep 18 2023 22:56:30 GMT+0800 (China Standard Time)

I am not sure why you are referencing SIGUSR1, is it not SIGHUP being sent when reload is used?

Systemd may send a SIGHUP for reload which may lead to a stop of ExaBGP.

I believe I ended up not using SIGHUP for reload as it can be sent by terminals to indicate that it was closed. We are going back to 2009 when I made these decisions: systemd did not exist yet and standards were not as clear as now.

Systemd should issue a SIGTERM on "stop" (and "restart"). and then restart the program when the program terminates.

Perhaps the systemd file should be changed?

ExecReload=/bin/kill -SIGUSR1 $MAINPID