thumbor-community / aws

Thumbor AWS extensions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tc_aws performance compared to http_loader

seichner opened this issue · comments

Hi all!

@phoet and i are testing the tc_aws-backend on a bunch of Heroku instances with quite a lot of traffic. Before we used the HTTP-Loader. I attached Thumbor's reponse time graph, which shows a significant increase from 19.00, when we switched from http loader to tc_aws (the orange line is the median, jumping from ~300ms to ~450ms.

We wondered if anybody has an idea on possible reasons? What could be the major difference between tc_aws and http-loader requesting an S3-file? Looks a bit like tc_aws runs into some kind of resource limit...?
We will move back to the http loader for now, but still intend to switch to tc_aws if we can fix the performance problem and integrate metrics on the s3 connection times.

response_times

(green and blue line are 95/99 percentiles, which hit our 30sec timeout limit much more frequently with tc_aws)

unfortunately the backend does not have metrics by default, so we don't have s3 request-times as we have for http-loader.

what is your topology ?
is s3 bucket region collocated with ec2 instances ?
What HTTP endpoint do you queries ? does it use CDN (cloudfront) ?

no CDN

heroku dynos run on EU
S3 bucket is EU west-1

when we load images via the http-loader, it's through s3-eu-west-1.amazonaws.com

we don't have anything configured for the s3 region in boto. so it's probably something like us-west. is there a configuration option for it?

yes, very fresh : #12

@dhardy92 i was asking about the configuration option for region specifically. the only thing i found was passing host='s3-eu-west-1.amazonaws.com' to the S3Connection constructor. is that what you mean?

@dhardy92 we've set the endpoint to eu-west-1 (where our herokus run) - didnt change anything on the repsonse times. I also did a short benchmark, requesting one of our S3 files from eu-west and from us-east - nearly no measurable difference. So this looks consistent to me.

we can also see an increased memory consumption on our dynos around 7pm when we switched to the boto backend:

screen shot 2015-08-07 at 13 49 57

maybe linked with : #10 (a python dev would say better than me)

I can't chime in for tc_aws but I would guess #10 might be behind it.

The biggest factor we found in Thumbor performance is figuring out the fastest shared storage and result_storage the Thumbor instance can access.

At PopKey we decided to use the http_loader + some custom code to hide the source bucket. We use a redis instance for the storage and result_storage giving us great overall response time on 4+ virtual machines (auto-scaled with load).

#10 being fixed now, could anyone check if that was it that was impacting the performance? cc @seichner @phoet
It would be helpful to see if we need to investigate things furthermore :)

we did another test today and it did not impact us as badly as the last time, but the performance is still pretty bad compared to the async http loader.

one thing that i noticed today is from our fastly statistics. it looks like requests are blocked, i assume all of one dyno, and released at the same time. that might point to something blocking the event-loop with the aws backend:

with tc_aws:
miss-latency-pillars

with async http loader and presigned urls:
miss-latency-fanning

There is a high chance the aws loader is blocking the i/o loop. Anything that isn't made for Tornado tends to block the loop ( like the redis loader, but it's so fast we don't see the blocking ).

One way to go around this would be to use the multiprocessing module and queue requests to a separate process pool that will fetch the images from AWS and pass them back to the main process using queues.

Might require a little bit of refactoring but it would probably help with concurrency.

http://tornado.readthedocs.org/en/latest/process.html might help with that.

Created issue #21 to handle this whenever we get the chance :)

i think we can close this. from my point of view it is clearly an issue with blocking io and can be resolved with either using #22 or implementing #21

another thing that @seichner figured out was to improve response time by using STORAGE = 'thumbor.storages.no_storage' instead of the default file-storage. this cut our response-times by about 40%.

since we have a high traffic site with 20 to 60 dynos, it is unlikely that an instance can re-use the stored image. fastly takes care of the caching.

@phoet this reflection & conclusions might be worth a documentation page, to get some tips on how to improve performance. Would you be up to this?

@Bladrak i'm not sure this is globally applicable. could give an overview about our usecase though.

Maybe not all tricks are for all use cases, but having a use case debrief is often useful to a lot of people :)

i'm going to write something up in the main thumbor wiki

Ok great! We'll add a link to it in here :)

@phoet interesting discovery on the no_storage. At PopKey we use a bunch of redis instances for storage and result_storage.

@masom we tried using the redis storage, but as far as i remember, the redis client has the same blocking io issues that we had with boto. because of that it was completely unusable with our setup.

@phoet we currently face the same issue. We looked at non-blocking redis solutions and it's somewhat non-trivial to figure out if we would win anything.

https://github.com/thefab/tornadis might be worth implementing as a redis-backend.