Zephyr: k_msleep() doesn't work

Question

Zephyr: k_msleep() doesn't work

m42uko opened this issue 6 months ago · comments

Hi,

I modified the default hello_world sample program to print the Hello World! message in a loop with a delay as follows:

diff --git a/samples/hello_world/src/main.c b/samples/hello_world/src/main.c
index 2758d75d3f..a0646c789d 100644
--- a/samples/hello_world/src/main.c
+++ b/samples/hello_world/src/main.c
@@ -5,9 +5,13 @@
  */
 
 #include <stdio.h>
+#include <zephyr/kernel.h>
 
 int main(void)
 {
-       printf("Hello World! %s\n", CONFIG_BOARD);
+       while (1) {
+               printf("Hello World! %s\n", CONFIG_BOARD);
+               k_msleep(1);
+       }
        return 0;
 }

However, when I run this, I only get a single Hello World printout (this happens for both Zephyr v3.5.0 and v2.4.0):


[markus@571aaf79e9b4 workspace]$ fusesoc run --target=verilator_tb servant --uart_baudrate=57600 --firmware=hello.hex
INFO: Preparing ::serv:1.2.1
INFO: Preparing ::servant:1.2.1
make: Nothing to be done for 'all'.
Loading RAM from /home/markus/workspace/hello.hex
*** Booting Zephyr OS build zephyr-v3.5.0 ***
Hello World! service
^C
Caught ctrl-c

INFO: ****************************
INFO: ****   FuseSoC aborted  ****
INFO: ****************************

Can you reproduce this / is this a known limitation of servant/service?

Thanks!
Markus

Markus · Answer 1 · Mon Dec 04 2023 01:43:21 GMT+0800 (China Standard Time)

Quoting your reply on #111 (to keep the msleep discussion in one thread):

I haven't had time to look at your work yet, but regarding the msleep issue, there's one thing I would check first. At least in Zephyr 2.4.0, there's a bug that causes mul operations to be emitted in the library code even if the Kconfig settings says not to use hard mul and div. Not sure if this is the problem, but you can e.g. check the compiled zephyr.lst and see if there are any mul operations present.

I first tried on v2.4.0 and ran into the MUL bug. Without any fix (i.e. with "illegal" MULs in the binary), k_msleep sleeps for zero time (returns immediately), with your fix from #102 (comment), it shows the same behavior as vanilla 3.5.0. I haven't checked that there are really no more MULs present. I can look into that in the upcoming days just to make sure.

Olof Kindgren · Answer 2 · Mon Dec 04 2023 02:05:05 GMT+0800 (China Standard Time)

I checked now with 2.4.0 and I can reproduce the issue. TBH, I never really understood how this timer API is supposed to work, so there's a fair chance I'm doing something strange but I can't see what it is right away. Will need some debugging. In case you haven't seen already, you can run with --trace_pc and --vcd to dump the PC and waveform of all signals respectively.

--trace_pc generates a trace.bin in the work root. You can this Python script by passing the Zephyr .elf file as the first argument and trace.bin as the second to generate a listing of how SERV jumps between functions. It's not the best debug tool out there but can be helpful.

Markus · Answer 3 · Mon Dec 04 2023 02:08:24 GMT+0800 (China Standard Time)

Ooooh nice, thanks for the tips with debugging this. I'll see whether I can figure something out. (I'll probably get to it only next weekend though.)

Markus · Answer 4 · Mon Dec 04 2023 02:28:52 GMT+0800 (China Standard Time)

I think I found the issue. sys_clock_driver_init (formerly `z_clock_driver_init') is never called. I added the corresponding macro as part of my v3.5.0 PR.

diff --git a/zephyr/drivers/timer/serv_timer.c b/zephyr/drivers/timer/serv_timer.c
index 1691fa1..2039217 100644
--- a/zephyr/drivers/timer/serv_timer.c
+++ b/zephyr/drivers/timer/serv_timer.c
@@ -122,3 +122,6 @@ uint32_t sys_timer_cycle_get_32(void)
 {
        return mtime();
 }
+
+SYS_INIT(sys_clock_driver_init, PRE_KERNEL_2,
+        CONFIG_SYSTEM_CLOCK_INIT_PRIORITY);

I haven't verified that the timing checks out yet though.

Markus · Answer 5 · Mon Dec 04 2023 02:35:36 GMT+0800 (China Standard Time)

Verified timing on CMOD-A7 HW (Xilinx Artix-7 15T): Looks good! Checked only by looking at a blinking LED, so I didn't check down to the microsecond, but I think that's good enough (for now).

Olof Kindgren · Answer 6 · Mon Dec 04 2023 02:46:25 GMT+0800 (China Standard Time)

Well spotted! I'll do a proper review of #111 as soon as I can

Markus · Answer 7 · Mon Dec 04 2023 02:47:53 GMT+0800 (China Standard Time)

Cool, thanks!

Markus · Answer 8 · Mon Dec 04 2023 17:18:46 GMT+0800 (China Standard Time)

Oh oh, actually it's not quite fixed yet. After running the design for a few minutes (~10 minutes is enough, probably less), the application freezes again. Probably something to do with the timer overflowing. I haven't had the time to look at the code yet though.

Markus · Answer 9 · Fri Dec 08 2023 15:29:36 GMT+0800 (China Standard Time)

I think this is a problem in the Verilog timer. Take the following example which resets the Verilog timer close to its overflow value (and adds some debugging):

diff --git a/servant/servant_timer.v b/servant/servant_timer.v
index 45306e9..656c1cd 100644
--- a/servant/servant_timer.v
+++ b/servant/servant_timer.v
@@ -23,15 +23,21 @@ module servant_timer
       o_wb_dat[HIGH:0] = mtimeslice;
    end
 
+   always @(posedge o_irq) begin
+      $display("IRQ @0x%x | 0x%x", mtime, mtimecmp);
+   end
+
    always @(posedge i_clk) begin
-      if (i_wb_cyc & i_wb_we)
+      if (i_wb_cyc & i_wb_we) begin
 	mtimecmp <= i_wb_dat[HIGH:0];
+        $display("set irq = 0x%x", i_wb_dat[HIGH:0]);
+      end
       mtime <= mtime + 'd1;
       o_irq <= (mtimeslice - mtimecmp >= 0);
       if (RESET_STRATEGY != "NONE")
 	if (i_rst) begin
-	   mtime <= 0;
-	   mtimecmp <= 0;
+	   mtime <= 4293916296;
+	   mtimecmp <= 4293916296;
 	end
    end
 endmodule

If you then run my loopy hello_world, you get:

make: Nothing to be done for 'all'.
Loading RAM from /home/markus/workspace/hello.hex
IRQ @0xffeff688 | 0xffeff688
set irq = 0xfff16059
set irq = 0xfff16059
IRQ @0xfff1605a | 0xfff16059
*** Booting Zephyr OS build zephyr-v3.5.0 ***
Hello World! service
set irq = 0xfffffc80
set irq = 0xfffffc80
IRQ @0xfffffc81 | 0xfffffc80
set irq = 0x00000900
set irq = 0x00000900
set irq = 0x00009280
set irq = 0x00009280
set irq = 0x00012240
set irq = 0x00012240
set irq = 0x0001abc0
set irq = 0x0001abc0
set irq = 0x00023b80
set irq = 0x00023b80
set irq = 0x0002cb40
set irq = 0x0002cb40
set irq = 0x00035b00
set irq = 0x00035b00
set irq = 0x0003eac0
set irq = 0x0003eac0
set irq = 0x00047a80
set irq = 0x00047a80
set irq = 0x00050a40
set irq = 0x00050a40
set irq = 0x00059a00
set irq = 0x00059a00
set irq = 0x00062380
set irq = 0x00062380
set irq = 0x0006b340
set irq = 0x0006b340
set irq = 0x00074300
set irq = 0x00074300
set irq = 0x0007d2c0
set irq = 0x0007d2c0
set irq = 0x00086280
set irq = 0x00086280
[... many more, they are printed immediately, IRQ is permanently active]

PS: I think #109 was supposed to address this, but I do not see how this makes a difference. Example 4-bit timer, max count 15.

IRQ! We are at count 13. -> We set the next cmp to 4
Verilog does 13 - 4 = 9, which is greater than zero -> next interrupt fires immediately.

Without having thought it through, wouldn't the following be the proper fix (and even reduce logic):

mtime always increments 1 (no change)
if mtime == mtimecmp -> set IRQ (compare equal, set latching IRQ flag)
if write to bus -> reset IRQ and update mtimecmp (reset latch on bus access)

OR: As a more extreme alternative, change the timer to a self-resetting kind when it reaches cmp (auto retrigger). Advantage: In a tickful kernel, we wouldn't have to do any math in the IRQ (which is important, because there are divisions and multiplications!) Disadvantage: I'll have to check, but we might lose support of tickless kernels, but I'm not sure.

Olof Kindgren · Answer 10 · Fri Dec 08 2023 16:21:04 GMT+0800 (China Standard Time)

Good debugging! Yes, #109 was supposed to fix this, but in hindsight I should definitely have checked myself that it worked correctly. I don't have any strong opinions on what kind of timer we want to have. The idea was to have a 32-bit timer, but otherwise follow the RISC-V machine timer as closely as possible to make the software support easier. Not sure if I succeeded in that. I agree there's an alarming number of math operations in the IRQ handler, and avoiding those sounds like a good idea. And I'm not sure I understand the implications of a tickless kernel, tbh. Power savings seems to be one thing, but there might be other benefits as well?

A third option could be to just implement a proper 64-bit RISC-V machine timer instead. It will cost a little bit more, but since it's an optional feature outside of the CPU, it's not the end of the world. The 2018 SoftCPU contest is already over by now and that was really the main reason for keeping down the size of the whole SoC.

Markus · Answer 11 · Fri Dec 08 2023 20:55:01 GMT+0800 (China Standard Time)

Actually, #109 may be correct after all, and the report above might just be related to flawed testing. If I initialize the last_count variable properly upon boot (instead of assuming it to be zero), it seems to work fine. Doesn't explain the "stops working after 10 mins" though....

diff --git a/zephyr/drivers/timer/serv_timer.c b/zephyr/drivers/timer/serv_timer.c
index 2039217..d48b771 100644
--- a/zephyr/drivers/timer/serv_timer.c
+++ b/zephyr/drivers/timer/serv_timer.c
@@ -60,7 +60,8 @@ static void timer_isr(void *arg)
 int sys_clock_driver_init()
 {
        IRQ_CONNECT(RISCV_MACHINE_TIMER_IRQ, 0, timer_isr, NULL, 0);
-       set_mtimecmp(mtime() + (uint32_t)CYC_PER_TICK);
+       last_count = mtime();
+       set_mtimecmp(last_count + (uint32_t)CYC_PER_TICK);
        irq_enable(RISCV_MACHINE_TIMER_IRQ);
        return 0;
 }

(Added that change to my PR.)

Olof Kindgren · Answer 12 · Fri Dec 08 2023 21:32:21 GMT+0800 (China Standard Time)

I can think of a couple of different strategies. One would be to bisect between 2.4 and 3.5 to figure out where it stops working. Or did you say it's not working with 2.4 either? It sucks if we need to wait ten minutes for it to display the problem.

Another way would be to compare the soc support files in serv with the ones upstream and see if there is some issue there.

Replacing the serv_timer with a real risc-v machine timer is probably also a good idea to avoid having to carry our own driver. I started working on that a bit. Will push something later. Also just pushed a branch with some helpful debug registers to make troubleshooting with waveforms slightly less awful https://github.com/olofk/serv/tree/dbg

Markus · Answer 13 · Fri Dec 08 2023 23:30:14 GMT+0800 (China Standard Time)

As noted in PR #111 (#111 (comment)), this problems seems to be related to the toolchain emitting invalid MUL instructions (as part of picolibc, even if disabled), breaking the timing calculations. The driver and Verilog appears to be fine.