Obsolete Performance information

Question

Obsolete Performance information

jserv opened this issue 5 years ago · comments

With the integration of AVX acceleration, thread pool, and various tweaks we have done, the diagram and descriptions in section "performance of attachToTangle" of file README.md are indeed out-of-date. We should generate new materials to reflect recent changes.

Expected output:

Apply AVX2-accelerated PoW for attachToTangle performance gain, on Intel Xeon E5;
Apply AVX1-accelerated PoW for attachToTangle performance gain, on AMD Ryzen Threadripper;
Compare vanilla IRI, SSE-accelerated IRI, and AVX-accelerated IRI for the above machines;

marktwtn · Answer 1 · Wed Nov 20 2019 23:47:49 GMT+0800 (China Standard Time)

The original performance experiment is based on:

Each sampling is measured with 30 transaction trytes and total 200 samples are measured.
MWM = 14, 26 CPU threads to find nonce
Settings: enable 2 pow tasks in CPU, 1 pow tasks in GPU at the same time

I would like to change the settings to

Enable CPU only
Enable remote FPGA boards only

And about the expected output

1. Apply AVX2-accelerated PoW for attachToTangle performance gain, on Intel Xeon E5;
2. Apply AVX1-accelerated PoW for attachToTangle performance gain, on AMD Ryzen Threadripper;

Do we still have to run dcurl on the different hardware?

The third request

3. Compare vanilla IRI, SSE-accelerated IRI, and AVX-accelerated IRI for the above machines;

would add another comparison for remote FPGA boards.

marktwtn · Answer 2 · Sat Nov 30 2019 23:07:36 GMT+0800 (China Standard Time)

Experiment environment:

Hardware: node1
Connection: IRI and RabbitMQ broker are located on the same machine, and the connections with the remote workers are on the local network
Input data: 200 transaction bundle with each containing 2 transactions, each transaction is unique

Result:

Explanation:

IRI version	attachToTangle() behavior	Effect
IOTA IRI	one transaction bundle at a time(synchronized)	transactions of the bundle are calculated one by one
DLTCollab IRI	multiple transaction bundles at the same time	transactions of different bundles compete for the PoW calculation resources

That is the reason why the execution time of IOTA IRI behaves better at the beginning.

SSE behaves better than AVX
This is kind of weird since AVX should be faster than SSE.
One possible reason is the bundle competition.
I will do the experiment and check the result again.

Since we have the competition factor in IRI, maybe we should add another performance graph of the PoW time of each transaction to illustrate the dcurl acceleration power?

Jim Huang · Answer 3 · Sun Dec 01 2019 07:42:17 GMT+0800 (China Standard Time)

Can we conclude that one FPGA cluster consisting of 4 nodes is good enough for accelerating transactions? It helps the elimination of network latency and security risks comparing to PoWsrv, but still efficient.

marktwtn · Answer 4 · Sun Dec 01 2019 14:37:31 GMT+0800 (China Standard Time)

Can we conclude that one FPGA cluster consisting of 4 nodes is good enough for accelerating transactions? It helps the elimination of network latency and security risks comparing to PoWsrv, but still efficient.

Yes.
However, we must be aware that the network latency can be longer if the IRI and RabbitMQ broker are on the different machine and the connections with the remote workers are not on the local network.

marktwtn · Answer 5 · Thu Jan 16 2020 20:24:24 GMT+0800 (China Standard Time)

- **SSE** behaves better than **AVX**
  This is kind of weird since **AVX** should be faster than **SSE**.
  One possible reason is the bundle competition.
  I will do the experiment and check the result again.

I did the experiment again, and it turned out that the result is still the same.
The SSE version of multiple attachToTangle() API calculation is faster than the AVX version.
@jserv is this possibly caused by the characteristic of the register?

And about this issue, I will render a better performance chart and send a pull request to override the old one.
After that, I think we can close the issue.

Jim Huang · Answer 6 · Fri Jan 17 2020 10:13:44 GMT+0800 (China Standard Time)

I would like to clarify the performance about AVX. Do both AVX1 and AVX2 behave worse than SSE implementation for Xeon E5?

Jim Huang · Answer 7 · Fri Jan 17 2020 10:14:54 GMT+0800 (China Standard Time)

AVX is tricky since it depends on microarchitecture of Intel and AMD.

marktwtn · Answer 8 · Mon Jan 20 2020 13:24:48 GMT+0800 (China Standard Time)

I will do the experiment on node10 instead of node1 to test the microarchitecture difference.

marktwtn · Answer 9 · Sat Feb 08 2020 16:22:42 GMT+0800 (China Standard Time)

The experiment is run on the node10 with CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz.
Besides, the environment is a virtual machine.

I do not know the reason of the instantly decreased time around 130 ~ 140 transaction bundle.
However, from the graph we can see that the SSE version is the slowest and the AVX version is the fastest, even if comparing with the AVX2 version.

There is one thing I need to mention.
The previous performance chart is using AVX2 instead of AVX.

I doubt that our AVX version is somehow faster than AVX2 version but it is not well tested.
And the microarchitecure of CPU do affect the performance.

I think the last performance measurment would still run on the node1 with CPU AMD Ryzen Threadripper 2990WX 32-Core Processor since the remote worker is only connected with node1.
And the comparison of AVX and AVX2 will be shown on the graph.

marktwtn · Answer 10 · Tue Feb 11 2020 13:24:00 GMT+0800 (China Standard Time)

I revisualize the previous experiment result as the below chart

If the chart is acceptable, I will add some description and replace the chart on README.md with a pull request.
Then close the issue.

Jim Huang · Answer 11 · Tue Feb 11 2020 16:56:28 GMT+0800 (China Standard Time)

Go ahead along with descriptions about experiments and conclusion.