Better solutions to web.archive.org rate limiting

Question

Better solutions to web.archive.org rate limiting

jclark1913 opened this issue 7 months ago · comments

Overview

The tool works best when given smaller requests of <10 urls and a snapshot limit of <500. Currently, the asyncio library's build in semaphore does an ok job of avoiding rate limiting when kept to these recommended parameters, but I wonder if there is a better or more dynamic way to deal with this issue? The issue does not appear to be with the CDX api itself, but rather a larger issue with making numerous requests to web.archive.org when getting snapshots that causes a temporary ban. All in all, I'm finding web.archive.org to be a bit unpredictable and cannot find consistent documentation for making requests to the site.

Possible solutions

Incorporating a library w/ exponential delays

There are some Python libraries like Backoff and aiohttp_retry that provide some wrappers for dealing with getting rate limited. I've messed around with both, but wasn't able to get large requests (>50 urls + >1000 limit) to work reliably.

Custom solution

There might be a way to determine the best parameters based on the size of the request. Such a solution might dynamically generate a semaphore value or incorporate some kind of jitter between calls, or maybe pause the operation and prompt the user to wait 5 minutes before attempting to resume.

Miguel Sozinho Ramalho · Answer 1 · Mon Nov 27 2023 20:08:45 GMT+0800 (China Standard Time)

So I recently discovered that the cdx api has the following rate limit logic:
Requests are limited to an average of 60/min. Over that and we start getting 429s. If 429s are ignored for more than a minute the IP gets blocked for 1 hour. Subsequent 429s over a given period will double that time each occurrence.
So ideally, If we can keep the api request < 60/minute we will prevent this from happening.