prov/net: failing fi_rdm_tagged_peek test

Question

prov/net: failing fi_rdm_tagged_peek test

shefty opened this issue a year ago · comments

CI is failing consistently. Best failure report may be this one:

name:   fi_rdm_tagged_peek -p "net"
16:50:57    timestamp: 20221021-235057+0000
16:50:57    result: Fail
16:50:57    time:   2
16:50:57    server_cmd: /home/cstbuild/ofi-Install/libfabric/ofi_libfabric/PR-8142/4/dbg/bin/fi_rdm_tagged_peek -p "net"   -s ci7-eth2
16:50:57    server_stdout: |
16:50:57    client_cmd: /home/cstbuild/ofi-Install/libfabric/ofi_libfabric/PR-8142/4/dbg/bin/fi_rdm_tagged_peek -p "net"   -s ci8-eth2 ci7-eth2
16:50:57    client_stdout: |
16:50:57      FI_PEEK | FI_DISCARD(): functional/rdm_tagged_peek.c:160, ret=-265 (Truncation error)
16:50:57      fi_rdm_tagged_peek: prov/util/src/util_buf.c:256: ofi_bufpool_destroy: Assertion `(pool->attr.flags & OFI_BUFPOOL_NO_TRACK) || !ofi_atomic_get32(&buf_region->use_cnt)' failed.

Most reports are along the lines of this:

client_cmd: /home/cstbuild/ofi-Install/libfabric/ofi_libfabric/PR-8162/1/reg/bin/fi_rdm_tagged_peek -p "net"   -s ci8-eth2 ci7-eth2
18:27:50    client_stdout: |
18:27:50      Searching for a bad msg
18:27:50      Searching for a bad msg with claim
18:27:50      Searching for first msg
18:27:50      Receiving first msg
18:27:50      Searching for second msg to claim
18:27:50      Receiving second msg
18:27:50      Searching for third msg to peek and discard
18:27:50      FI_PEEK | FI_DISCARD(): functional/rdm_tagged_peek.c:160, ret=-265 (Truncation error)

Sean Hefty · Answer 1 · Fri Oct 28 2022 10:47:14 GMT+0800 (China Standard Time)

Debugging, the problem occurs after processing the 3rd message (peek & discard). It appears to be related to the freeing of the internally allocated buffer

Sean Hefty · Answer 2 · Fri Oct 28 2022 11:29:17 GMT+0800 (China Standard Time)

Capture of debug output near the crash:

Searching for third msg to peek and discard
xfer alloc 0x697910
malloc (0x697910) 0x688a70 1024
free (0x697910) 0x688be8 648
free(): invalid pointer

The data is repeatable, though other errors can show up. The change in the iov_len (1024-648) is also the increase in the iov_base (0x697910 -> 0x688be8).

Sean Hefty · Answer 3 · Fri Oct 28 2022 12:50:05 GMT+0800 (China Standard Time)

@ooststep - The problem is that the receive processing can update the iov stored in the recv_entry. This happens when there's less data available to read from the socket than the msg size. The iov is updated to reflect what's been read, so that the next recv operation will continue filling in the buffer at the correct location. When recv_entry is freed, the iov_base no longer references the malloc location.

We may be able to record the malloc address in iov[1] (assuming that the receive handling doesn't clear it), and free it from there. Though I'd look at doing it in not quite so ugly a way. e.g. Look at a union for the iov, define some constant for '1', or try moving the malloc into the xnet_rx_alloc() call (to pair with the free).

Stephen Oost · Answer 4 · Sat Oct 29 2022 01:17:49 GMT+0800 (China Standard Time)

Does this suggest we need a dedicated member for discarding? Wouldn't a union for the iov have the same issue?