tc_aws performance compared to http_loader

Question

tc_aws performance compared to http_loader

seichner opened this issue 9 years ago · comments

Hi all!

@phoet and i are testing the tc_aws-backend on a bunch of Heroku instances with quite a lot of traffic. Before we used the HTTP-Loader. I attached Thumbor's reponse time graph, which shows a significant increase from 19.00, when we switched from http loader to tc_aws (the orange line is the median, jumping from ~300ms to ~450ms.

We wondered if anybody has an idea on possible reasons? What could be the major difference between tc_aws and http-loader requesting an S3-file? Looks a bit like tc_aws runs into some kind of resource limit...?
We will move back to the http loader for now, but still intend to switch to tc_aws if we can fix the performance problem and integrate metrics on the s3 connection times.

seichner · Answer 1 · Fri Aug 07 2015 16:44:35 GMT+0800 (China Standard Time)

(green and blue line are 95/99 percentiles, which hit our 30sec timeout limit much more frequently with tc_aws)

Peter Schröder · Answer 2 · Fri Aug 07 2015 16:45:35 GMT+0800 (China Standard Time)

unfortunately the backend does not have metrics by default, so we don't have s3 request-times as we have for http-loader.

Damien Hardy · Answer 3 · Fri Aug 07 2015 16:50:30 GMT+0800 (China Standard Time)

what is your topology ?
is s3 bucket region collocated with ec2 instances ?
What HTTP endpoint do you queries ? does it use CDN (cloudfront) ?

Peter Schröder · Answer 4 · Fri Aug 07 2015 17:10:49 GMT+0800 (China Standard Time)

no CDN

heroku dynos run on EU
S3 bucket is EU west-1

when we load images via the http-loader, it's through s3-eu-west-1.amazonaws.com

we don't have anything configured for the s3 region in boto. so it's probably something like us-west. is there a configuration option for it?

Damien Hardy · Answer 5 · Fri Aug 07 2015 17:23:41 GMT+0800 (China Standard Time)

yes, very fresh : #12

Peter Schröder · Answer 6 · Fri Aug 07 2015 18:03:41 GMT+0800 (China Standard Time)

@dhardy92 i was asking about the configuration option for region specifically. the only thing i found was passing host='s3-eu-west-1.amazonaws.com' to the S3Connection constructor. is that what you mean?

seichner · Answer 7 · Fri Aug 07 2015 19:43:49 GMT+0800 (China Standard Time)

@dhardy92 we've set the endpoint to eu-west-1 (where our herokus run) - didnt change anything on the repsonse times. I also did a short benchmark, requesting one of our S3 files from eu-west and from us-east - nearly no measurable difference. So this looks consistent to me.

Peter Schröder · Answer 8 · Fri Aug 07 2015 19:50:39 GMT+0800 (China Standard Time)

we can also see an increased memory consumption on our dynos around 7pm when we switched to the boto backend:

Damien Hardy · Answer 9 · Fri Aug 07 2015 19:56:02 GMT+0800 (China Standard Time)

maybe linked with : #10 (a python dev would say better than me)

Martin Samson · Answer 10 · Sat Aug 08 2015 01:36:44 GMT+0800 (China Standard Time)

I can't chime in for tc_aws but I would guess #10 might be behind it.

The biggest factor we found in Thumbor performance is figuring out the fastest shared storage and result_storage the Thumbor instance can access.

At PopKey we decided to use the http_loader + some custom code to hide the source bucket. We use a redis instance for the storage and result_storage giving us great overall response time on 4+ virtual machines (auto-scaled with load).

Hugo Briand · Answer 11 · Wed Aug 12 2015 21:02:31 GMT+0800 (China Standard Time)

#10 being fixed now, could anyone check if that was it that was impacting the performance? cc @seichner @phoet
It would be helpful to see if we need to investigate things furthermore :)

Peter Schröder · Answer 12 · Tue Aug 18 2015 22:31:50 GMT+0800 (China Standard Time)

we did another test today and it did not impact us as badly as the last time, but the performance is still pretty bad compared to the async http loader.

one thing that i noticed today is from our fastly statistics. it looks like requests are blocked, i assume all of one dyno, and released at the same time. that might point to something blocking the event-loop with the aws backend:

with tc_aws:

with async http loader and presigned urls:

Martin Samson · Answer 13 · Tue Aug 18 2015 22:39:26 GMT+0800 (China Standard Time)

There is a high chance the aws loader is blocking the i/o loop. Anything that isn't made for Tornado tends to block the loop ( like the redis loader, but it's so fast we don't see the blocking ).

One way to go around this would be to use the multiprocessing module and queue requests to a separate process pool that will fetch the images from AWS and pass them back to the main process using queues.

Might require a little bit of refactoring but it would probably help with concurrency.

http://tornado.readthedocs.org/en/latest/process.html might help with that.

Martin Samson · Answer 14 · Tue Aug 18 2015 22:42:25 GMT+0800 (China Standard Time)

https://gist.github.com/FZambia/5756470

Hugo Briand · Answer 15 · Tue Aug 18 2015 23:40:07 GMT+0800 (China Standard Time)

Created issue #21 to handle this whenever we get the chance :)

Peter Schröder · Answer 16 · Tue Sep 22 2015 15:43:29 GMT+0800 (China Standard Time)

i think we can close this. from my point of view it is clearly an issue with blocking io and can be resolved with either using #22 or implementing #21

Peter Schröder · Answer 17 · Tue Sep 22 2015 15:45:50 GMT+0800 (China Standard Time)

another thing that @seichner figured out was to improve response time by using STORAGE = 'thumbor.storages.no_storage' instead of the default file-storage. this cut our response-times by about 40%.

since we have a high traffic site with 20 to 60 dynos, it is unlikely that an instance can re-use the stored image. fastly takes care of the caching.

Hugo Briand · Answer 18 · Tue Sep 22 2015 16:31:48 GMT+0800 (China Standard Time)

@phoet this reflection & conclusions might be worth a documentation page, to get some tips on how to improve performance. Would you be up to this?

Peter Schröder · Answer 19 · Tue Sep 22 2015 17:33:37 GMT+0800 (China Standard Time)

@Bladrak i'm not sure this is globally applicable. could give an overview about our usecase though.

Hugo Briand · Answer 20 · Tue Sep 22 2015 17:35:01 GMT+0800 (China Standard Time)

Maybe not all tricks are for all use cases, but having a use case debrief is often useful to a lot of people :)

Peter Schröder · Answer 21 · Tue Sep 22 2015 17:35:55 GMT+0800 (China Standard Time)

i'm going to write something up in the main thumbor wiki

Hugo Briand · Answer 22 · Tue Sep 22 2015 17:36:49 GMT+0800 (China Standard Time)

Ok great! We'll add a link to it in here :)

Martin Samson · Answer 23 · Tue Sep 22 2015 21:11:46 GMT+0800 (China Standard Time)

@phoet interesting discovery on the no_storage. At PopKey we use a bunch of redis instances for storage and result_storage.

Peter Schröder · Answer 24 · Tue Sep 22 2015 23:57:31 GMT+0800 (China Standard Time)

@masom we tried using the redis storage, but as far as i remember, the redis client has the same blocking io issues that we had with boto. because of that it was completely unusable with our setup.

Martin Samson · Answer 25 · Wed Sep 23 2015 00:33:04 GMT+0800 (China Standard Time)

@phoet we currently face the same issue. We looked at non-blocking redis solutions and it's somewhat non-trivial to figure out if we would win anything.

https://github.com/thefab/tornadis might be worth implementing as a redis-backend.