ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/net: failing fi_rdm_tagged_peek test

shefty opened this issue · comments

CI is failing consistently. Best failure report may be this one:

name:   fi_rdm_tagged_peek -p "net"
16:50:57    timestamp: 20221021-235057+0000
16:50:57    result: Fail
16:50:57    time:   2
16:50:57    server_cmd: /home/cstbuild/ofi-Install/libfabric/ofi_libfabric/PR-8142/4/dbg/bin/fi_rdm_tagged_peek -p "net"   -s ci7-eth2
16:50:57    server_stdout: |
16:50:57    client_cmd: /home/cstbuild/ofi-Install/libfabric/ofi_libfabric/PR-8142/4/dbg/bin/fi_rdm_tagged_peek -p "net"   -s ci8-eth2 ci7-eth2
16:50:57    client_stdout: |
16:50:57      FI_PEEK | FI_DISCARD(): functional/rdm_tagged_peek.c:160, ret=-265 (Truncation error)
16:50:57      fi_rdm_tagged_peek: prov/util/src/util_buf.c:256: ofi_bufpool_destroy: Assertion `(pool->attr.flags & OFI_BUFPOOL_NO_TRACK) || !ofi_atomic_get32(&buf_region->use_cnt)' failed.

Most reports are along the lines of this:

client_cmd: /home/cstbuild/ofi-Install/libfabric/ofi_libfabric/PR-8162/1/reg/bin/fi_rdm_tagged_peek -p "net"   -s ci8-eth2 ci7-eth2
18:27:50    client_stdout: |
18:27:50      Searching for a bad msg
18:27:50      Searching for a bad msg with claim
18:27:50      Searching for first msg
18:27:50      Receiving first msg
18:27:50      Searching for second msg to claim
18:27:50      Receiving second msg
18:27:50      Searching for third msg to peek and discard
18:27:50      FI_PEEK | FI_DISCARD(): functional/rdm_tagged_peek.c:160, ret=-265 (Truncation error)

Debugging, the problem occurs after processing the 3rd message (peek & discard). It appears to be related to the freeing of the internally allocated buffer

Capture of debug output near the crash:

Searching for third msg to peek and discard
xfer alloc 0x697910
malloc (0x697910) 0x688a70 1024
free (0x697910) 0x688be8 648
free(): invalid pointer

The data is repeatable, though other errors can show up. The change in the iov_len (1024-648) is also the increase in the iov_base (0x697910 -> 0x688be8).

@ooststep - The problem is that the receive processing can update the iov stored in the recv_entry. This happens when there's less data available to read from the socket than the msg size. The iov is updated to reflect what's been read, so that the next recv operation will continue filling in the buffer at the correct location. When recv_entry is freed, the iov_base no longer references the malloc location.

We may be able to record the malloc address in iov[1] (assuming that the receive handling doesn't clear it), and free it from there. Though I'd look at doing it in not quite so ugly a way. e.g. Look at a union for the iov, define some constant for '1', or try moving the malloc into the xnet_rx_alloc() call (to pair with the free).

Does this suggest we need a dedicated member for discarding? Wouldn't a union for the iov have the same issue?