Bad throughput performance in flush-to-persistent
xywoo-cs opened this issue · comments
I have same perf problem like #1143 (comment)_ ...
Basically, I just edit "05-flush-to-persistent" to make clients keep writing data to remote PM.
Each request is a 1KB sequential write, and the working set size is 1GB. But the throughput result is only ~80MB/s...
config:
- 1 * 128GB PM
- DDIO disabled
- Connect-X 4 100Gbps
Pseudocode:
- server:
...
rpma_peer_cfg_new(&pcfg);
rpma_peer_cfg_set_direct_write_to_pmem(pcfg, true);
server_peer_via_address(addr, &peer);
rpma_ep_listen(peer, addr, port, &ep);
rpma_mr_reg(peer, mr_ptr, mr_size,
RPMA_MR_USAGE_WRITE_DST | RPMA_MR_USAGE_READ_SRC |
RPMA_MR_USAGE_FLUSH_TYPE_PERSISTENT,
&mr);
rpma_mr_get_descriptor_size(mr, &mr_desc_size);
rpma_peer_cfg_get_descriptor_size(pcfg, &pcfg_desc_size);
rpma_mr_get_descriptor(mr, &data.descriptors[0]);
rpma_peer_cfg_get_descriptor(pcfg, &data.descriptors[mr_desc_size]);
server_accept_connection(ep, NULL, &pdata, &conn);
common_wait_for_conn_close_and_disconnect(&conn);
...
- client:
mr_ptr = malloc_aligned(ONE_GB);
client_peer_via_address(addr, &peer);
client_connect(peer, addr, port, NULL, NULL, &conn);
rpma_mr_reg(peer, mr_ptr, ONE_GB, RPMA_MR_USAGE_WRITE_SRC, &src_mr);
rpma_conn_get_private_data(conn, &pdata);
rpma_peer_cfg_from_descriptor(
&dst_data->descriptors[dst_data->mr_desc_size],
dst_data->pcfg_desc_size, &pcfg);
rpma_peer_cfg_get_direct_write_to_pmem(pcfg, &direct_write_to_pmem);
rpma_conn_apply_remote_peer_cfg(conn, pcfg);
rpma_peer_cfg_delete(&pcfg);
rpma_mr_remote_from_descriptor(&dst_data->descriptors[0],
dst_data->mr_desc_size, &dst_mr);
rpma_mr_remote_get_size(dst_mr, &dst_size);
for (remote_offset = 0; remote_offset < ONE_GB; remote_offset += len) {
rpma_write(conn, dst_mr, remote_offset, src_mr, 0, len,
RPMA_F_COMPLETION_ON_ERROR, NULL);
rpma_flush(conn, dst_mr, remote_offset, len, flush_type,
RPMA_F_COMPLETION_ALWAYS, FLUSH_ID);
rpma_cq_wait(cq);
rpma_cq_get_wc(cq, 1, &wc, NULL);
...
}
having no idea how to fix this... Thank you so much
To get the best performance please avoid rpma_cq_wait().
See RPMA Fio engine implementation:
https://github.com/axboe/fio/blob/6f1a24593c227a4f392f454698aca20e95f0006c/engines/librpma_gpspm.c#L678
GPSPM mode without busy-wait_polling
manifests performance degradation.
https://github.com/pmem/rpma/releases/download/0.10.0/RPMA_Perf_report_CLX_MLX_CentOS8.2_DEVDAX.pdf
Many thanks for the prompt reply
So for APM mode I used, the reason of this bad performance case is the queue size or io depth is too small?
Try to remove rpma_cq_wait(cq);
also from the client-side and use busy-wait-polling mechanism instead:
do() {
ret = rpma_cq_get_wc(cq, 1, &wc, NULL);
if (ret < 0)
goto error_handling;
} while (!ret);
You can also increase len
value and io depth to use flush every n write operations.
Try to remove
rpma_cq_wait(cq);
also from the client-side and use busy-wait-polling mechanism instead:do() { ret = rpma_cq_get_wc(cq, 1, &wc, NULL); if (ret < 0) goto error_handling; } while (!ret);
You can also increase
len
value and io depth to use flush every n write operations.
a bit of confusion here...
Since APM (ARRM) uses a following read to ensure persistence, this procedure should be a sync write. (just like writing data to cache and flush cache in local PM)
So does removing rpma_cq_wait(cq)
make APM writes async?
rpma_cq_wait()
uses a kernel mechanism for a system event generated on completion receiving. It saves some CPU power/time but increases overall operation latency.
To get the best performance of completion handling only rpma_cq_get_wc()
is needed.
Hi @xywoo-cs ,
have you managed to fix your example performance?
Sorry to reply so late. I tried avoid rpma_cq_wait()
and still couldn't get the expected performance.
But thanks a lot for your help!!