axi-spi-engine driver watchdog timer still calls back after being canceled

Question

axi-spi-engine driver watchdog timer still calls back after being canceled

dlech opened this issue a year ago · comments

          Unrelated, but I noticed another odd thing when everything is working properly (no timeout): the watchdog timer doesn't seem to be canceled on successful completion and still fires 5 seconds afterwards.

        iio_attr-1170    [000] d.h2.   129.119232: spi_engine_read_rx_fifo <-spi_engine_irq
        iio_attr-1170    [000] d.h2.   129.119233: spi_engine_read_buff <-spi_engine_read_rx_fifo
        iio_attr-1170    [000] d.h2.   129.119235: spi_engine_rx_next <-spi_engine_read_rx_fifo
        iio_attr-1170    [000] d.h2.   129.119236: spi_engine_xfer_next <-spi_engine_rx_next
        iio_attr-1170    [000] d.h2.   129.119238: spi_engine_complete_message <-spi_engine_irq
          <idle>-0       [000] ..s2.   134.152771: spi_engine_timeout <-call_timer_fn

del_timer() is called immediately before spi_engine_complete_message() so I would expect it to be canceled and not call the callback.

Originally posted by @dlech in #2287 (comment)

Nuno Sá · Answer 1 · Wed Jan 24 2024 22:56:09 GMT+0800 (China Standard Time)

Sooo, I had to deal with this thing and saw the same behavior. As it turns out the problem is not as crazy or in core code as we were thinking... Note the below log (additional printk placed by me):

[  110.207544] Transfer one spi message... c1c9bda8
[  110.207578] Del watchdog timer (0)
[  110.207585] complete spi message c1c9bda8
[  110.207620] Modify Timer base...
root@analog:~# dmesg -c
[  115.280465] SPI ENGINE TIMEOUT

From the above it's clear that we're being faster in processing the spi transfer than calling mod_timer(). That's also visible by the return code from del_timer() which is 0 - > deactivated timer. So, if we move the mod_timer() call before activating the interrupts, we get:

[  382.201474] Transfer one spi message... c6919da8
[  382.201503] Modify Timer base...
[  382.201520] Del watchdog timer (1)
[  382.201532] complete spi message c6919da8

and no spurious soft interrupt. This was actually causing a kernel panic in a usecase where we play with overlays and a loop with load + unloading the overlay would eventually panic as we would end up (at some point) in the timer callback with the spi controller unbound...

So, a simple patch for this would be:

diff --git a/drivers/spi/spi-axi-spi-engine.c b/drivers/spi/spi-axi-spi-engine.c
index 050096874afe..44cbc40abde3 100644
--- a/drivers/spi/spi-axi-spi-engine.c
+++ b/drivers/spi/spi-axi-spi-engine.c
@@ -651,12 +651,13 @@ static int spi_engine_transfer_one_message(struct spi_master *master,

        int_enable |= SPI_ENGINE_INT_SYNC;

+       mod_timer(&spi_engine->watchdog_timer, jiffies + 5 * HZ);
+
        writel_relaxed(int_enable,
                spi_engine->base + SPI_ENGINE_REG_INT_ENABLE);
        spi_engine->int_enable = int_enable;
        spin_unlock_irqrestore(&spi_engine->lock, flags);

-       mod_timer(&spi_engine->watchdog_timer, jiffies + 5*HZ);

        return 0;
 }

Now we are calling mod_timer() with the spin lock held but AFAICT, it should not be an issue... Do you see any issue with it @dlech?

I just checked and this should not be an issue in the upstream version of the driver but we will probably only sync the upstream version of the controller in the main branch and not in 2022_R2 where we also need a fix and that is why I'm asking if this looks sane to you :).

I can also just call mod_timer() right before acquiring the spin...

David Lechner · Answer 2 · Wed Jan 24 2024 23:15:19 GMT+0800 (China Standard Time)

I think it is fine in the spinlock since the timer should have the TIMER_IRQSAFE flag. But if you want to be consistent with the upstream driver, I just put it before the spinlock since it isn't possible for two calls to spi_engine_transfer_one_message() to happen at the same time.

Nuno Sá · Answer 3 · Wed Jan 24 2024 23:17:30 GMT+0800 (China Standard Time)

Alright,

I'll do this for the 2022_R2 branch and assume for main we well get it in sync with upstream + offload stuff