ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/tcp: truncation error disables ep

hzhou opened this issue · comments

Describe the bug
I am trying to get the tcp;ofi_rxm to work with MPICH. Most tests are okay except those with truncation errors. This simple test using 2 processes. P0 sends to P1 8 bytes while P1 posts recv for 4 bytes. This completes with correct FI_ETRUNC cq entry. But P1 will hang on later communications. Here is the log:

[omitting messages during init]
[1] libfabric:1035542:1669680299::ofi_rxm:core:rxm_ep_setopt():502<info> FI_OPT_MIN_MULTI_RECV set to 16384
[1] libfabric:1035542:1669680299::ofi_rxm:av:ofi_ip_av_insertv():649<debug> inserting 2 addresses
[1] libfabric:1035542:1669680299::ofi_rxm:av:ip_av_insert_addr():619<debug> av_insert addr: fi_sockaddr_in://140.221.16.19:43631
[1] libfabric:1035542:1669680299::ofi_rxm:av:ip_av_insert_addr():621<debug> av_insert fi_addr: 0
[1] libfabric:1035542:1669680299::ofi_rxm:av:ip_av_insert_addr():619<debug> av_insert addr: fi_sockaddr_in://140.221.16.19:44601
[1] libfabric:1035542:1669680299::ofi_rxm:av:ip_av_insert_addr():621<debug> av_insert fi_addr: 1
[1] libfabric:1035542:1669680299::ofi_rxm:av:ofi_ip_av_insertv():667<debug> 2 addresses successful
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_alloc_conn():419<debug> allocated conn 0x55b6823f01b8
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_send_connect():286<debug> connecting 0x55b6823f01b8
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_open_conn():184<debug> open msg ep 0x55b6823f01b8
[1] libfabric:1035542:1669680299::tcp:core:ofi_check_rx_attr():802<info> Tx only caps ignored in Rx caps
[1] libfabric:1035542:1669680299::tcp:core:ofi_check_tx_attr():900<info> Rx only caps ignored in Tx caps
[1] libfabric:1035542:1669680299::tcp:core:ofi_check_rx_attr():802<info> Tx only caps ignored in Rx caps
[1] libfabric:1035542:1669680299::tcp:core:ofi_check_tx_attr():900<info> Rx only caps ignored in Tx caps
[1] libfabric:1035542:1669680299::tcp:ep_ctrl:tcpx_accept():553<debug> accepting connection
[1] libfabric:1035542:1669680299::tcp:ep_ctrl:tcpx_cm_send_req():502<debug> client send connreq
[1] libfabric:1035542:1669680299::tcp:ep_ctrl:tcpx_cm_recv_req():430<debug> Server receive connect request
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_process_connreq():675<info> connreq for 0x55b6823f01b8
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_process_connreq():684<info> simultaneous, reject peer 0x55b6823f01b8
[1] libfabric:1035542:1669680299::tcp:ep_ctrl:tcpx_cm_recv_resp():319<debug> Handling accept from server
[1] libfabric:1035542:1669680299::tcp:ep_ctrl:ofi_wait_add_fd():250<debug> Given fd (24) already added to wait list - 0x55b6823c22b0
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_process_connect():507<debug> processing connected for handle: 0x55b6823f01b8
[1] libfabric:1035542:1669680299::ofi_rxm:cq:rxm_handle_recv_comp():789<debug> Got TAGGED op
[1] libfabric:1035542:1669680299::ofi_rxm:cq:rxm_cq_write():866<debug> Reporting FI_RECV, FI_TAGGED, FI_REMOTE_CQ_DATA completion
[1] libfabric:1035542:1669680299::tcp:ep_data:tcpx_update_rx_iov():233<warn> dynamically provided rbuf is too small
[1] libfabric:1035542:1669680299::tcp:ep_data:tcpx_process_recv():279<warn> msg recv failed ret = -265 (Truncation error)
[1] libfabric:1035542:1669680299::ofi_rxm:cq:rxm_handle_comp_error():1725<warn> fi_cq_readerr: err: Truncation error (265), prov_err: Operation now in progress (115)
[1] OK
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_process_shutdown():761<info> shutdown conn 0x55b6823f01b8 (state 3)
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_close_conn():64<debug> closing conn 0x55b6823f01b8
[1] libfabric:1035542:1669680299::tcp:fabric:ofi_wait_del_fd():219<info> Given fd (24) not found in wait list - 0x55b68239fda0
[1] libfabric:1035542:1669680299::ofi_rxm:ep_ctrl:rxm_free_conn():349<debug> free conn 0x55b6823f01b8
[mpiexec@tiger] APPLICATION TIMED OUT, TIMEOUT = 30s

Some tracing shows that tcpx_ep_disable was called after processing the truncation error.

To Reproduce
Build mpich, run mpich testsuite test/mpi/errors/pt2pt/truncmsg1.c.

Expected behavior
Test to print "No Errors"

Output
Hang

Environment:
Linux (ubuntu 20.04)

Additional:
This is tested with libfabric v1.15.2

The provider is designed to tear down the connection on a receive side truncation error. The provider should try to reconnect on a subsequent send, but any data sent over the old connection is obviously lost.

We can look at treating truncation errors differently, which requires flushing off the truncated data from the tcp stream. You may also be able to work-around this by disabling 'dynamic receive buffering'. That will come with a performance impact with all data being copied through bounce buffers.

Thanks. That explains. Yes, flushing off the truncated data seems sensible.

This has been addressed in the latest main for the tcp provider. Truncation errors are reported up to the application, but the tcp stream is kept active. The MPICH testsuite truncation tests failed prior to the changes, but now pass.

Note that for MPI, starting with v1.18, the recommendation is to use tcp by itself without rxm.

Note that for MPI, starting with v1.18, the recommendation is to use tcp by itself without rxm.

👍