DLTcollab / dcurl

Hardware-accelerated Multi-threaded IOTA PoW, drop-in replacement for ccurl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Obsolete Performance information

jserv opened this issue · comments

With the integration of AVX acceleration, thread pool, and various tweaks we have done, the diagram and descriptions in section "performance of attachToTangle" of file README.md are indeed out-of-date. We should generate new materials to reflect recent changes.

Expected output:

  1. Apply AVX2-accelerated PoW for attachToTangle performance gain, on Intel Xeon E5;
  2. Apply AVX1-accelerated PoW for attachToTangle performance gain, on AMD Ryzen Threadripper;
  3. Compare vanilla IRI, SSE-accelerated IRI, and AVX-accelerated IRI for the above machines;

The original performance experiment is based on:

  • Each sampling is measured with 30 transaction trytes and total 200 samples are measured.
  • MWM = 14, 26 CPU threads to find nonce
  • Settings: enable 2 pow tasks in CPU, 1 pow tasks in GPU at the same time

I would like to change the settings to

  • Enable CPU only
  • Enable remote FPGA boards only

And about the expected output

1. Apply AVX2-accelerated PoW for attachToTangle performance gain, on Intel Xeon E5;
2. Apply AVX1-accelerated PoW for attachToTangle performance gain, on AMD Ryzen Threadripper;

Do we still have to run dcurl on the different hardware?

The third request

3. Compare vanilla IRI, SSE-accelerated IRI, and AVX-accelerated IRI for the above machines;

would add another comparison for remote FPGA boards.

Experiment environment:

  • Hardware: node1
  • Connection: IRI and RabbitMQ broker are located on the same machine, and the connections with the remote workers are on the local network
  • Input data: 200 transaction bundle with each containing 2 transactions, each transaction is unique

Result:
bundle

Explanation:

  • IRI version attachToTangle() behavior Effect
    IOTA IRI one transaction bundle at a time(synchronized) transactions of the bundle are calculated one by one
    DLTCollab IRI multiple transaction bundles at the same time transactions of different bundles compete for the PoW calculation resources

    That is the reason why the execution time of IOTA IRI behaves better at the beginning.

  • SSE behaves better than AVX
    This is kind of weird since AVX should be faster than SSE.
    One possible reason is the bundle competition.
    I will do the experiment and check the result again.

Since we have the competition factor in IRI, maybe we should add another performance graph of the PoW time of each transaction to illustrate the dcurl acceleration power?

Can we conclude that one FPGA cluster consisting of 4 nodes is good enough for accelerating transactions? It helps the elimination of network latency and security risks comparing to PoWsrv, but still efficient.

Can we conclude that one FPGA cluster consisting of 4 nodes is good enough for accelerating transactions? It helps the elimination of network latency and security risks comparing to PoWsrv, but still efficient.

Yes.
However, we must be aware that the network latency can be longer if the IRI and RabbitMQ broker are on the different machine and the connections with the remote workers are not on the local network.

- **SSE** behaves better than **AVX**
  This is kind of weird since **AVX** should be faster than **SSE**.
  One possible reason is the bundle competition.
  I will do the experiment and check the result again.

I did the experiment again, and it turned out that the result is still the same.
The SSE version of multiple attachToTangle() API calculation is faster than the AVX version.
@jserv is this possibly caused by the characteristic of the register?

And about this issue, I will render a better performance chart and send a pull request to override the old one.
After that, I think we can close the issue.

I would like to clarify the performance about AVX. Do both AVX1 and AVX2 behave worse than SSE implementation for Xeon E5?

AVX is tricky since it depends on microarchitecture of Intel and AMD.

I will do the experiment on node10 instead of node1 to test the microarchitecture difference.

The experiment is run on the node10 with CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz.
Besides, the environment is a virtual machine.

bundle

I do not know the reason of the instantly decreased time around 130 ~ 140 transaction bundle.
However, from the graph we can see that the SSE version is the slowest and the AVX version is the fastest, even if comparing with the AVX2 version.


There is one thing I need to mention.
The previous performance chart is using AVX2 instead of AVX.
bundle


I doubt that our AVX version is somehow faster than AVX2 version but it is not well tested.
And the microarchitecure of CPU do affect the performance.

I think the last performance measurment would still run on the node1 with CPU AMD Ryzen Threadripper 2990WX 32-Core Processor since the remote worker is only connected with node1.
And the comparison of AVX and AVX2 will be shown on the graph.

I revisualize the previous experiment result as the below chart

bundle

If the chart is acceptable, I will add some description and replace the chart on README.md with a pull request.
Then close the issue.

Go ahead along with descriptions about experiments and conclusion.