"Failed to pull frames" when using multiple DPNIs
mcbridematt opened this issue · comments
Hardware: Ten64
MC firmware: 10.20
Commit: 6efa7d1
When more than one interface / DPNI is transferring data, the following errors appear in the system console / dmesg:
dpaa2_ni0: failed to pull frames: chan_id=15, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
dpaa2_ni0: failed to pull frames: chan_id=15, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
dpaa2_ni0: failed to pull frames: chan_id=15, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
dpaa2_ni0: failed to pull frames: chan_id=15, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
dpaa2_ni0: failed to pull frames: chan_id=15, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
dpaa2_ni1: failed to pull frames: chan_id=23, error=16
An example use case is when the system is being used as a router between two network interfaces.
I don't see any evidence of packet loss which is good.
This message is printed by dpaa2_ni_poll_task
around line 2367:
freebsd-src/sys/dev/dpaa2/dpaa2_ni.c
Lines 2364 to 2370 in 6efa7d1
It seems those errors appear when different DPNIs use the same DPIO (struct dpaa2_swp). Driver keeps the software portal busy executing a Volatile Dequeue command for too long, i.e.
/* Make VDQ command available again. */
atomic_xchg(&swp->vdq.avail, 1);
is set too late, I think.
EDIT: It's my guess though. I'll check it and prepare a patch.
@mcbridematt Could you try the latest commit? I don't have this error reported starting from efe105c.
@dsalychev Yes, no more errors after updating the kernel. I'll move some devices behind this machine and see how it goes.
FYI, for a system that did 4.9TB of traffic over 14 hours I still got a few warnings in dmesg:
dmesg | grep 'failed to pull frames' | wc -l
590
590 / 5TB is a very small rate, but I don't know enough to judge how important the warning message is.
Could you show an output of sysctl dev.dpaa2_ni.0
(for dpni0) for all of the interfaces reported the errors? I'm particularly interested in
dev.dpaa2_ni.0.stats.in_discarded_frames: 18
dev.dpaa2_ni.0.stats.in_nobuf_discards: 0
After a 1hr iperf run that logged around 70 failed to pull frames messages, none of the interfaces had discards
dev.dpaa2_ni.1.stats.in_all_frames: 75934381
dev.dpaa2_ni.1.stats.in_all_bytes: 5011725630
dev.dpaa2_ni.1.stats.in_multi_frames: 0
dev.dpaa2_ni.1.stats.eg_all_frames: 157009142
dev.dpaa2_ni.1.stats.eg_all_bytes: 237634857392
dev.dpaa2_ni.1.stats.eg_multi_frames: 0
dev.dpaa2_ni.1.stats.in_filtered_frames: 0
dev.dpaa2_ni.1.stats.in_discarded_frames: 0
dev.dpaa2_ni.1.stats.in_nobuf_discards: 0
dev.dpaa2_ni.1.stats.tx_sg_frames: 157009292
dev.dpaa2_ni.1.stats.tx_single_buf_frames: 0
dev.dpaa2_ni.1.stats.rx_ieoi_err_frames: 0
dev.dpaa2_ni.1.stats.rx_enq_rej_frames: 0
dev.dpaa2_ni.1.stats.rx_sg_buf_frames: 0
dev.dpaa2_ni.1.stats.rx_single_buf_frames: 75934511
dev.dpaa2_ni.1.stats.rx_anomaly_frames: 0
dev.dpaa2_ni.1.channels.7.tx_dropped: 0
dev.dpaa2_ni.1.channels.7.tx_frames: 0
dev.dpaa2_ni.1.channels.6.tx_dropped: 0
dev.dpaa2_ni.1.channels.6.tx_frames: 0
dev.dpaa2_ni.1.channels.5.tx_dropped: 0
dev.dpaa2_ni.1.channels.5.tx_frames: 0
dev.dpaa2_ni.1.channels.4.tx_dropped: 0
dev.dpaa2_ni.1.channels.4.tx_frames: 0
dev.dpaa2_ni.1.channels.3.tx_dropped: 0
dev.dpaa2_ni.1.channels.3.tx_frames: 0
dev.dpaa2_ni.1.channels.2.tx_dropped: 0
dev.dpaa2_ni.1.channels.2.tx_frames: 0
dev.dpaa2_ni.1.channels.1.tx_dropped: 0
dev.dpaa2_ni.1.channels.1.tx_frames: 0
dev.dpaa2_ni.1.channels.0.tx_dropped: 0
dev.dpaa2_ni.1.channels.0.tx_frames: 157009437
dev.dpaa2_ni.1.%parent: dpaa2_rc0
dev.dpaa2_ni.1.%pnpinfo:
dev.dpaa2_ni.1.%location:
dev.dpaa2_ni.1.%driver: dpaa2_ni
dev.dpaa2_ni.1.%desc: DPAA2 Network Interface
dev.dpaa2_ni.2.stats.in_all_frames: 165393160
dev.dpaa2_ni.2.stats.in_all_bytes: 250312613260
dev.dpaa2_ni.2.stats.in_multi_frames: 0
dev.dpaa2_ni.2.stats.eg_all_frames: 48486702
dev.dpaa2_ni.2.stats.eg_all_bytes: 3200223070
dev.dpaa2_ni.2.stats.eg_multi_frames: 0
dev.dpaa2_ni.2.stats.in_filtered_frames: 0
dev.dpaa2_ni.2.stats.in_discarded_frames: 0
dev.dpaa2_ni.2.stats.in_nobuf_discards: 0
dev.dpaa2_ni.2.stats.tx_sg_frames: 48486702
dev.dpaa2_ni.2.stats.tx_single_buf_frames: 0
dev.dpaa2_ni.2.stats.rx_ieoi_err_frames: 0
dev.dpaa2_ni.2.stats.rx_enq_rej_frames: 0
dev.dpaa2_ni.2.stats.rx_sg_buf_frames: 0
dev.dpaa2_ni.2.stats.rx_single_buf_frames: 165392672
dev.dpaa2_ni.2.stats.rx_anomaly_frames: 0
dev.dpaa2_ni.2.channels.7.tx_dropped: 0
dev.dpaa2_ni.2.channels.7.tx_frames: 0
dev.dpaa2_ni.2.channels.6.tx_dropped: 0
dev.dpaa2_ni.2.channels.6.tx_frames: 0
dev.dpaa2_ni.2.channels.5.tx_dropped: 0
dev.dpaa2_ni.2.channels.5.tx_frames: 0
dev.dpaa2_ni.2.channels.4.tx_dropped: 0
dev.dpaa2_ni.2.channels.4.tx_frames: 0
dev.dpaa2_ni.2.channels.3.tx_dropped: 0
dev.dpaa2_ni.2.channels.3.tx_frames: 0
dev.dpaa2_ni.2.channels.2.tx_dropped: 0
dev.dpaa2_ni.2.channels.2.tx_frames: 0
dev.dpaa2_ni.2.channels.1.tx_dropped: 0
dev.dpaa2_ni.2.channels.1.tx_frames: 0
dev.dpaa2_ni.2.channels.0.tx_dropped: 0
dev.dpaa2_ni.2.channels.0.tx_frames: 48486702
(This is with the buffer commits reverted: 19d8245, 846462f, 48d302a)
These are good news. I'll try to prepare a debug code to check whether those frames were processed at all and not dropped silently after an error returned by dpaa2_swp_pull().
@mcbridematt Could you test with 1a7aba9?
@dsalychev I now see a few 'timeout to consume frames' errors as well, is that expected?
dpaa2_ni0: dpaa2_ni_poll_task: failed to pull frames: chan_id=16, error=16
dpaa2_ni0: dpaa2_ni_poll_task: failed to pull frames: chan_id=23, error=16
dpaa2_ni0: dpaa2_ni_poll_task: failed to pull frames: chan_id=23, error=16
dpaa2_ni0: dpaa2_ni_poll_task: timeout to consume frames: chan_id=23
dpaa2_ni1: dpaa2_ni_poll_task: failed to pull frames: chan_id=4, error=16
dpaa2_ni0: dpaa2_ni_poll_task: failed to pull frames: chan_id=16, error=16
dpaa2_ni0: dpaa2_ni_poll_task: failed to pull frames: chan_id=23, error=16
dpaa2_ni1: dpaa2_ni_poll_task: timeout to consume frames: chan_id=24
dpaa2_ni0: dpaa2_ni_poll_task: timeout to consume frames: chan_id=23
dpaa2_ni1: dpaa2_ni_poll_task: failed to pull frames: chan_id=4, error=16
dpaa2_ni0: dpaa2_ni_poll_task: failed to pull frames: chan_id=16, error=16
sysctls:
dev.dpaa2_ni.0.stats.in_all_frames: 33739237
dev.dpaa2_ni.0.stats.in_all_bytes: 2227163082
dev.dpaa2_ni.0.stats.in_multi_frames: 0
dev.dpaa2_ni.0.stats.eg_all_frames: 76976026
dev.dpaa2_ni.0.stats.eg_all_bytes: 116515198666
dev.dpaa2_ni.0.stats.eg_multi_frames: 0
dev.dpaa2_ni.0.stats.in_filtered_frames: 0
dev.dpaa2_ni.0.stats.in_discarded_frames: 0
dev.dpaa2_ni.0.stats.in_nobuf_discards: 0
dev.dpaa2_ni.0.stats.buf_free: 0
dev.dpaa2_ni.0.stats.buf_num: 2800
dev.dpaa2_ni.0.stats.tx_sg_frames: 76976026
dev.dpaa2_ni.0.stats.tx_single_buf_frames: 0
dev.dpaa2_ni.0.stats.rx_ieoi_err_frames: 0
dev.dpaa2_ni.0.stats.rx_enq_rej_frames: 0
dev.dpaa2_ni.0.stats.rx_sg_buf_frames: 0
dev.dpaa2_ni.0.stats.rx_single_buf_frames: 33739234
dev.dpaa2_ni.0.stats.rx_anomaly_frames: 0
...
dev.dpaa2_ni.0.channels.0.tx_frames: 76976026
dev.dpaa2_ni.1.stats.in_all_frames: 32743170
dev.dpaa2_ni.1.stats.in_all_bytes: 2161390320
dev.dpaa2_ni.1.stats.in_multi_frames: 0
dev.dpaa2_ni.1.stats.eg_all_frames: 75728550
dev.dpaa2_ni.1.stats.eg_all_bytes: 114619322702
dev.dpaa2_ni.1.stats.eg_multi_frames: 0
dev.dpaa2_ni.1.stats.in_filtered_frames: 0
dev.dpaa2_ni.1.stats.in_discarded_frames: 0
dev.dpaa2_ni.1.stats.in_nobuf_discards: 0
dev.dpaa2_ni.1.stats.buf_free: 0
dev.dpaa2_ni.1.stats.buf_num: 2800
dev.dpaa2_ni.1.stats.tx_sg_frames: 75728550
dev.dpaa2_ni.1.stats.tx_single_buf_frames: 0
dev.dpaa2_ni.1.stats.rx_ieoi_err_frames: 0
dev.dpaa2_ni.1.stats.rx_enq_rej_frames: 0
dev.dpaa2_ni.1.stats.rx_sg_buf_frames: 0
dev.dpaa2_ni.1.stats.rx_single_buf_frames: 32743170
dev.dpaa2_ni.1.stats.rx_anomaly_frames: 0
...
dev.dpaa2_ni.1.channels.0.tx_frames: 75728550
@mcbridematt
I've an experimental branch: https://github.com/mcusim/freebsd-src/tree/ten64
Could you try to run a stress test? I've been fighting another panic (Undefined instruction: ..., panic: Unknown kernel exception 0 esr_el1 2000000) and my Ten64 survived the last night under stress test. I wonder whether it helps to solve the issues with frames consuming.
Not seen on commit a85d6c9