OpenDataPlane / odp

The ODP project is an open-source, cross-platform set of application programming interfaces (APIs) for the networking data plane

Home Page:https://opendataplane.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fallocate is interrupted by signal at startup

chrhong opened this issue · comments

A pool create failed issue is detected in our system, error shows the system call fallocate is interruptted:
"odp_ishm.c:707:create_file():Huge page memory allocation failed: fd=582, file=/dev/hugepages/0/odp-16-ishm-pool_008_pkt-rx:7-0, err="Interrupted system call""

Is that better to retry the system call after getting the error return ?
While the signal is raised is unknown yet...

@MatiasElo Do you have any comments for this ?

Hmm, this is the first time I've seen this failure. Does this happen constantly or was it a random occurrence? Also, what was the return code of fallocate() and the size of allocated shm block?

The error occurs easily on k8s env, 10% recurrence. I think fallocate return core is EINTR(Interrupted system call)。 Size is around 4M

Thanks for the info. Looks like a good solution would be to add a number of retries if EINTR is received.

Does this change fix the issue you are seeing?

strange that the issue is not reproduced after I recompile...update later....

Update:

  1. When I recompile odp and copy new libs to my docker, the issue cannot be detected even in hundreds of restart;
  2. When I not update odp, the issue occurs easily. The most important thing is, there is nothing changed related with startup between new and old odp libs.
    Matias, do you know any method to trace which/why signal interrupt the system call ? I want to dig why the call is only interrupted with older libs.
    I use linux strace to trace my process, but didn't see any signal in my process...
mkdir("/dev/hugepages/0", 0744)         = -1 EEXIST (File exists)
open("/dev/hugepages/0/odp-48-ishm-far_pool", O_RDWR|O_CREAT|O_TRUNC, 0644) = 602
fallocate(602, 0, 0, 618659840)         = -1 EINTR (**Interrupted system call**)
write(2, "odp_ishm.c:707:create_file():Hug"..., 151) = 151
close(602)                              = 0
unlink("/dev/hugepages/0/odp-48-ishm-far_pool") = 0
write(2, "odp_ishm.c:1168:_odp_ishm_reserv"..., 112) = 112
mkdir("/dev/shm/0", 0744)               = -1 EEXIST (File exists)
open("/dev/shm/0/odp-48-ishm-far_pool", O_RDWR|O_CREAT|O_TRUNC, 0644) = 602
fallocate(602, 0, 0, 618139648)         = -1 ENOSPC (No space left on device)
write(2, "odp_ishm.c:707:create_file():Nor"..., 147) = 147
close(602)                              = 0
unlink("/dev/shm/0/odp-48-ishm-far_pool") = 0

The other issue, similar to this is that I sometimes meet SIGSEGV in dpdk which is called odp_pktio_start() at startup.
Since the pktio handler is created by odp_pktio_open(), so I do not think this is app codes issue.
I wonder if this is related with my env initialize ? do you have any env initialize example ?
Currently, we only create hugepages and load pmd for DPDK.

Thanks.

Hmm, I haven't had to trace signals before, so unfortunately I cannot help much. Usually I just isolate the data plane cores and redirect all signals to a set of control cores.

One thing which pops out in your log is No space left on device error. Perhaps you are running out space in /dev/shm. In the ODP CI Docker images we set --shm-size 8g to be on the safe side. I don't do any special environment setup for DPDK. I just map the huge pages and bind NICs as you have done.