Iroha performance

Question

Iroha performance

truongnmt opened this issue 6 years ago · comments

I setup an scenario to test Iroha performance. This is my environment spec:

AWS t3.small instance: 2 vCPUs, 2.5 GHz, Intel Skylake P-8175, 2 GiB memory
Iroha docker develop tag
1 host with 4 peers (nodes)
Python SDK with Flask server API to hander user request

I setup a scenario using JMeter. 300 threads (user) will send 1 create user request. And the result is very bad, error rate 95%.

Here is my 4 peers logs:
https://gist.github.com/truongnmt/83a179c3cb83f5830f9ac709238eaf13

Also I see a framework to test for blockchain is Caliper, install everything but on the part running benchmark is so confusing.

Truong Nguyen · Answer 1 · Wed Oct 03 2018 14:44:10 GMT+0800 (China Standard Time)

I just tuning some parameter here is what I got.

First attempt:

max_proposal_size: 50
proposal_delay: 7000
vote_delay: 100
300 threads, ramp-up 5 seconds (meaning that it take 5 seconds for all thread to send request. So roughly ~60 tx/s)
Error rate: 96%

Second attempt:

max_proposal_size: 200
proposal_delay: 7000
vote_delay: 100
300 threads, ramp-up 5 seconds
Error rate: 96%

Third attempt:

max_proposal_size: 300
proposal_delay: 10000
vote_delay: 100
300 threads, ramp-up 5 seconds
Error rate: 97%

Fourth attempt:

max_proposal_size: 300
proposal_delay: 1000
vote_delay: 100
300 threads, ramp-up 5 seconds
Error rate: 70.47%

Fifth attempt:

max_proposal_size: 300
proposal_delay: 500
vote_delay: 100
300 threads, ramp-up 5 seconds
Error rate: 35%

No.6 attempt:

max_proposal_size: 300
proposal_delay: 100
vote_delay: 100
300 threads, ramp-up 5 seconds
Error rate: 81.54% :(

Truong Nguyen · Answer 2 · Wed Oct 03 2018 15:13:20 GMT+0800 (China Standard Time)

Notice that on No. 6 attempt, I got error:

[2018-10-03 07:06:12.498330740][th:1261][warning] BlockLoaderImpl Block not found
[2018-10-03 07:06:12.499745529][th:1261][error] YacGate Could not get block from block loader

After this error, Iroha no recieve any request, even I restart Iroha instance, it said:

[2018-10-03 07:15:19.669143938][th:998][info] AsyncGrpcClient transactions in proposal: 1
[2018-10-03 07:15:19.669978975][th:998][info] OrderingGate Received new proposal, height: 864

It keep saying that and increasing height after I send request again.

Kitsu · Answer 3 · Fri Oct 05 2018 03:46:46 GMT+0800 (China Standard Time)

That amazing research you've done, thank you!
But still there's some lack of information to conclude and understand the core of the issues.

1 host with 4 peers (nodes)

Am I correct that you've launched one aws instance and 4 iroha peers on it? That may cause thread scheduling, memory exhausting and numerous of other issues.
Also where was the client launched? (on the same aws instance or outside)

I setup a scenario using JMeter ... error rate 95%

I'm not familiar with that tool. Could you explain what does "error rate" means in this particular context? Dropped network packets, cancels on OSI application level or smth else?

Any other additional info might be really helpful, so feel free anything other info (even if you consider is barely useful)!

Truong Nguyen · Answer 4 · Sat Oct 06 2018 19:37:18 GMT+0800 (China Standard Time)

Sorry for my late response.

1 host with 4 peers (nodes)

Yes one aws instance and 4 iroha peers. I setup JMeter (client) on my localhost. In AWS instance I setup Iroha and Iroha Python SDK with Flask server API to hander user request.

I did a test with 4 AWS instances, each has 1 peer. But I didn't notice any different in comparison with 1 instance 4 peers.

I think the reason is this. During a heavy load, due to the waiting time too long (either because of max_proposal_size or proposal_delay), most of the request go timeout. While the time to process transactions is very fast. With the default config:

"max_proposal_size" : 10,
"proposal_delay" : 5000,
"vote_delay" : 5000,
"load_delay" : 5000

I using Postman to send request to server, it takes 10s to complete. With 100 concurrent user, timeout is inevitable. So with this config:

max_proposal_size: 300
proposal_delay: 500
vote_delay: 100,
load_delay: 5000

The request only take 600-800ms @@. That's nut!

About the JMeter, that is a tool to test web performance. So I setup an scenario that 300 user will create account. Just provide params, host web server IP and it will run it for you. Success request will return status 200. The rest (501, 502 ...) is fail. So the error rate is the percent of failure request in total.

Truong Nguyen · Answer 5 · Mon Oct 15 2018 23:26:00 GMT+0800 (China Standard Time)

max_proposal_size: 300
proposal_delay: 500
vote_delay: 100,
load_delay: 5000

As I running with this params value for a while, I just got this bug without any peer shut down. Maybe vote_delay too fast that it haven't appear on another peer yet?

Nikita Alekseev · Answer 6 · Tue Oct 16 2018 20:00:07 GMT+0800 (China Standard Time)

Hi, regarding any performance numbers, we are currently in the process of optimization transactions.
Currently, we do not have a precise number to show. Latest dev branch can have about 300 tx/sec of throughput.

Regarding your benchmark, could you please specify how many transactions/sec did you send to iroha? Did you send them to a single peer or a whole network?

Regarding the bug you encountered, we are currently looking into it, please expect fixes to be in the dev branch soon.

Tran B. V. Son · Answer 7 · Thu Oct 18 2018 16:43:28 GMT+0800 (China Standard Time)

@nickaleks Could you send me a configuration for 300 tx/sec?
how many peers, hosts, config.docker file...etc

Nikita Alekseev · Answer 8 · Tue Oct 23 2018 19:00:41 GMT+0800 (China Standard Time)

Unfortunately, this number is not confirmed right now. As soon as we have a benchmark to show, It will be published.

Truong Nguyen · Answer 9 · Thu Oct 25 2018 18:38:51 GMT+0800 (China Standard Time)

Again, reading about the configuration tips and try to implement as suggested: raise max_proposal_size and proposal_delay on handle a lot of transactions.

Here is my setting:
max_proposal_size: 400
proposal_delay: 5000
vote_delay: 400

Here is the result with an individual request using Postman: 6-9s.

Using jmeter tool, I sent 400 transactions in 1 second and here is the result:

95% of the transactions failed. If I understand correctly, max_proposal_size: 400 and
proposal_delay: 5000 means that on receiving >400 transactions or after 5s, all transactions will be processed at once. I wonder why first 268 transactions failed already? It should have wait 🤔

Regarding your benchmark, could you please specify how many transactions/sec did you send to iroha? Did you send them to a single peer or a whole network?

I sent total 400 transactions in 1 second. I have 4 peers and I sent randomly between 4 peers.

Nikita Alekseev · Answer 10 · Thu Oct 25 2018 18:43:44 GMT+0800 (China Standard Time)

Are you sure your transactions are correct? What is the response status of those transactions?

Truong Nguyen · Answer 11 · Thu Oct 25 2018 18:54:32 GMT+0800 (China Standard Time)

Yes I'm sure all transactions are correct. I create random user name so no one has the same name. Here is a sample response:

Thread Name: Create user 1-157
Sample Start: 2018-10-25 17:18:54 ICT
Load time: 384
Connect Time: 192
Latency: 384
Size in bytes: 348
Sent bytes:163
Headers size in bytes: 166
Body size in bytes: 182
Sample Count: 1
Error Count: 1
Data type ("text"|"bin"|""): text
Response code: 502
Response message: Bad Gateway


HTTPSampleResult fields:
ContentType: text/html
DataEncoding: null

All failed transactions has the same response, 502 Bad gateway.

Truong Nguyen · Answer 12 · Thu Oct 25 2018 19:27:58 GMT+0800 (China Standard Time)

Here is error log on nginx: /var/log/nginx/error.log

2018/10/25 11:16:11 [error] 31138#31138: *4406 connect() to unix:/home/ubuntu/iroha/app.sock failed
(11: Resource temporarily unavailable) while connecting to upstream, client: <client_request_ip>,
server: <host IP>, request: "GET <api_link> HTTP/1.1", upstream:
"http://unix:/home/ubuntu/iroha/app.sock:<api_link>", host: "<host IP>"

Truong Nguyen · Answer 13 · Thu Oct 25 2018 19:39:43 GMT+0800 (China Standard Time)

Oh I think I figured it out, wait me a second @@

Truong Nguyen · Answer 14 · Sat Oct 27 2018 16:37:21 GMT+0800 (China Standard Time)

I think I could solve half of the problem.

I wonder why first 268 transactions failed already? It should have wait 🤔

So I inscrease the value of somaxconn. Simply put, the somaxconn is the maximum number of queued connections we want on a socket. This somaxconn specifies how long we want this line to be. If more clients attempt to connect to our server, more than the backlog, those connections will be dropped.

=> $ sudo nano /proc/sys/net/core/somaxconn and increase from 128 to 20000.

So all connections will be placed on queue to be processed. This is what I got:

If my understanding is true, why I send all 400 transaction at once but Iroha only return result 3 requests each time, 3 requests on 11s, 3 requests on 25s and 3 requests on 39s. I think I should have result of all 400 requests at once, since all of them are processed in one chunk? 🤔

Sara · Answer 15 · Thu Jan 17 2019 18:21:05 GMT+0800 (China Standard Time)

This is being worked on :) https://jira.hyperledger.org/browse/IR-17 - here's the link

Truong Nguyen · Answer 16 · Thu Jan 17 2019 18:23:27 GMT+0800 (China Standard Time)

Appriciated! Keep up the hard work! 💪💪💪