Bad throughput performance in flush-to-persistent

Question

Bad throughput performance in flush-to-persistent

xywoo-cs opened this issue 2 years ago · comments

I have same perf problem like #1143 (comment)_ ...

Basically, I just edit "05-flush-to-persistent" to make clients keep writing data to remote PM.
Each request is a 1KB sequential write, and the working set size is 1GB. But the throughput result is only ~80MB/s...

config:

1 * 128GB PM
DDIO disabled
Connect-X 4 100Gbps

Pseudocode:

server:

...
rpma_peer_cfg_new(&pcfg);
rpma_peer_cfg_set_direct_write_to_pmem(pcfg, true);
server_peer_via_address(addr, &peer);
rpma_ep_listen(peer, addr, port, &ep);
rpma_mr_reg(peer, mr_ptr, mr_size,
            RPMA_MR_USAGE_WRITE_DST | RPMA_MR_USAGE_READ_SRC |
            RPMA_MR_USAGE_FLUSH_TYPE_PERSISTENT,
            &mr);
rpma_mr_get_descriptor_size(mr, &mr_desc_size);
rpma_peer_cfg_get_descriptor_size(pcfg, &pcfg_desc_size);
rpma_mr_get_descriptor(mr, &data.descriptors[0]);
rpma_peer_cfg_get_descriptor(pcfg, &data.descriptors[mr_desc_size]);

server_accept_connection(ep, NULL, &pdata, &conn);
common_wait_for_conn_close_and_disconnect(&conn);
...

client:

mr_ptr = malloc_aligned(ONE_GB);
client_peer_via_address(addr, &peer);
client_connect(peer, addr, port, NULL, NULL, &conn);
rpma_mr_reg(peer, mr_ptr, ONE_GB, RPMA_MR_USAGE_WRITE_SRC, &src_mr);
rpma_conn_get_private_data(conn, &pdata);
rpma_peer_cfg_from_descriptor(
            &dst_data->descriptors[dst_data->mr_desc_size],
            dst_data->pcfg_desc_size, &pcfg);
rpma_peer_cfg_get_direct_write_to_pmem(pcfg, &direct_write_to_pmem);
rpma_conn_apply_remote_peer_cfg(conn, pcfg);
rpma_peer_cfg_delete(&pcfg);
rpma_mr_remote_from_descriptor(&dst_data->descriptors[0],
            dst_data->mr_desc_size, &dst_mr);
rpma_mr_remote_get_size(dst_mr, &dst_size);

for (remote_offset = 0; remote_offset < ONE_GB; remote_offset += len) {
    rpma_write(conn, dst_mr, remote_offset, src_mr, 0, len,
                     RPMA_F_COMPLETION_ON_ERROR, NULL);
    rpma_flush(conn, dst_mr, remote_offset, len, flush_type,
                     RPMA_F_COMPLETION_ALWAYS, FLUSH_ID);
    rpma_cq_wait(cq);
    rpma_cq_get_wc(cq, 1, &wc, NULL);
    ...
}

having no idea how to fix this... Thank you so much

Tomasz Gromadzki · Answer 1 · Mon May 23 2022 21:14:38 GMT+0800 (China Standard Time)

To get the best performance please avoid rpma_cq_wait().
See RPMA Fio engine implementation:
https://github.com/axboe/fio/blob/6f1a24593c227a4f392f454698aca20e95f0006c/engines/librpma_gpspm.c#L678
GPSPM mode without busy-wait_polling manifests performance degradation.
https://github.com/pmem/rpma/releases/download/0.10.0/RPMA_Perf_report_CLX_MLX_CentOS8.2_DEVDAX.pdf

Mathieu Herbaut · Answer 2 · Mon May 23 2022 21:36:11 GMT+0800 (China Standard Time)

Many thanks for the prompt reply
So for APM mode I used, the reason of this bad performance case is the queue size or io depth is too small?

Tomasz Gromadzki · Answer 3 · Tue May 24 2022 14:06:36 GMT+0800 (China Standard Time)

Try to remove rpma_cq_wait(cq); also from the client-side and use busy-wait-polling mechanism instead:

do() {
ret = rpma_cq_get_wc(cq, 1, &wc, NULL);
if (ret < 0)
goto error_handling;
} while (!ret);

You can also increase len value and io depth to use flush every n write operations.

Mathieu Herbaut · Answer 4 · Tue May 24 2022 18:24:41 GMT+0800 (China Standard Time)

Try to remove rpma_cq_wait(cq); also from the client-side and use busy-wait-polling mechanism instead:
do() {
ret = rpma_cq_get_wc(cq, 1, &wc, NULL);
if (ret < 0)
goto error_handling;
} while (!ret);
You can also increase len value and io depth to use flush every n write operations.

a bit of confusion here...
Since APM (ARRM) uses a following read to ensure persistence, this procedure should be a sync write. (just like writing data to cache and flush cache in local PM)
So does removing rpma_cq_wait(cq) make APM writes async?

Tomasz Gromadzki · Answer 5 · Thu May 26 2022 14:50:14 GMT+0800 (China Standard Time)

rpma_cq_wait() uses a kernel mechanism for a system event generated on completion receiving. It saves some CPU power/time but increases overall operation latency.
To get the best performance of completion handling only rpma_cq_get_wc() is needed.

Tomasz Gromadzki · Answer 6 · Thu Jun 02 2022 17:25:41 GMT+0800 (China Standard Time)

Hi @xywoo-cs ,
have you managed to fix your example performance?

Mathieu Herbaut · Answer 7 · Mon Jul 18 2022 21:00:35 GMT+0800 (China Standard Time)

Sorry to reply so late. I tried avoid rpma_cq_wait() and still couldn't get the expected performance.
But thanks a lot for your help!!