tc_aws performance compared to http_loader
seichner opened this issue · comments
Hi all!
@phoet and i are testing the tc_aws-backend on a bunch of Heroku instances with quite a lot of traffic. Before we used the HTTP-Loader. I attached Thumbor's reponse time graph, which shows a significant increase from 19.00, when we switched from http loader to tc_aws (the orange line is the median, jumping from ~300ms to ~450ms.
We wondered if anybody has an idea on possible reasons? What could be the major difference between tc_aws and http-loader requesting an S3-file? Looks a bit like tc_aws runs into some kind of resource limit...?
We will move back to the http loader for now, but still intend to switch to tc_aws if we can fix the performance problem and integrate metrics on the s3 connection times.
(green and blue line are 95/99 percentiles, which hit our 30sec timeout limit much more frequently with tc_aws)
unfortunately the backend does not have metrics by default, so we don't have s3 request-times as we have for http-loader.
what is your topology ?
is s3 bucket region collocated with ec2 instances ?
What HTTP endpoint do you queries ? does it use CDN (cloudfront) ?
no CDN
heroku dynos run on EU
S3 bucket is EU west-1
when we load images via the http-loader, it's through s3-eu-west-1.amazonaws.com
we don't have anything configured for the s3 region in boto. so it's probably something like us-west. is there a configuration option for it?
yes, very fresh : #12
@dhardy92 i was asking about the configuration option for region specifically. the only thing i found was passing host='s3-eu-west-1.amazonaws.com'
to the S3Connection
constructor. is that what you mean?
@dhardy92 we've set the endpoint to eu-west-1 (where our herokus run) - didnt change anything on the repsonse times. I also did a short benchmark, requesting one of our S3 files from eu-west and from us-east - nearly no measurable difference. So this looks consistent to me.
maybe linked with : #10 (a python dev would say better than me)
I can't chime in for tc_aws but I would guess #10 might be behind it.
The biggest factor we found in Thumbor performance is figuring out the fastest shared storage
and result_storage
the Thumbor instance can access.
At PopKey we decided to use the http_loader + some custom code to hide the source bucket. We use a redis instance for the storage
and result_storage
giving us great overall response time on 4+ virtual machines (auto-scaled with load).
we did another test today and it did not impact us as badly as the last time, but the performance is still pretty bad compared to the async http loader.
one thing that i noticed today is from our fastly statistics. it looks like requests are blocked, i assume all of one dyno, and released at the same time. that might point to something blocking the event-loop with the aws backend:
There is a high chance the aws loader is blocking the i/o loop. Anything that isn't made for Tornado tends to block the loop ( like the redis loader, but it's so fast we don't see the blocking ).
One way to go around this would be to use the multiprocessing module and queue requests to a separate process pool that will fetch the images from AWS and pass them back to the main process using queues.
Might require a little bit of refactoring but it would probably help with concurrency.
http://tornado.readthedocs.org/en/latest/process.html might help with that.
Created issue #21 to handle this whenever we get the chance :)
another thing that @seichner figured out was to improve response time by using STORAGE = 'thumbor.storages.no_storage'
instead of the default file-storage. this cut our response-times by about 40%.
since we have a high traffic site with 20 to 60 dynos, it is unlikely that an instance can re-use the stored image. fastly takes care of the caching.
@phoet this reflection & conclusions might be worth a documentation page, to get some tips on how to improve performance. Would you be up to this?
@Bladrak i'm not sure this is globally applicable. could give an overview about our usecase though.
Maybe not all tricks are for all use cases, but having a use case debrief is often useful to a lot of people :)
i'm going to write something up in the main thumbor wiki
Ok great! We'll add a link to it in here :)
@phoet interesting discovery on the no_storage
. At PopKey we use a bunch of redis instances for storage
and result_storage
.
@masom we tried using the redis storage, but as far as i remember, the redis client has the same blocking io issues that we had with boto. because of that it was completely unusable with our setup.
@phoet we currently face the same issue. We looked at non-blocking redis solutions and it's somewhat non-trivial to figure out if we would win anything.
https://github.com/thefab/tornadis might be worth implementing as a redis-backend.