cli53 retry exceeds Route53 api rate limit and is throttled

Question

cli53 retry exceeds Route53 api rate limit and is throttled

jjtroberts opened this issue 5 years ago · comments

Joe Roberts commented 5 years ago

Issue type

Bug report

cli53 version (cli53 --version)

0.8.12 and 0.8.15

OS / Platform

Linux 386
Darwin

Steps to reproduce

First off, lovely tool. Use it daily to make life better in our group. Thanks for your work on this project.

Up until July 2 our automated Jenkins job that backs up R53 zones to S3 was working normally with the script below. It started failing the morning of July 3. It runs once a day and keeps the past 30 days in S3.

We have 182 hosted zones in our AWS account.

#!/usr/bin/env bash

# Enter Bash "strict mode"
set -o errexit  # Exit immediately on any non-zero error exit status
set -o nounset  # Trigger error when expanding unset variables
set -o pipefail # Prevent errors in a pipeline from being masked
IFS=\$'\n\t'     # Internal Field Separator controls Bash word splitting

# Declare backup path & master zone files
BACKUP_PATH="$(date +%F)"
ZONES_FILE="all-zones.txt"
DNS_FILE="all-dns.txt"

echo "Backing up Route53: ${BACKUP_PATH}"

# Create date-stamped backup directory and enter it
mkdir -p "$BACKUP_PATH"
cd "$BACKUP_PATH"

# Create a list of all hosted zones
cli53 list --debug --format text > "$ZONES_FILE" 2>&1

# Create a list of domain names only
sed '/Name:/!d' "$ZONES_FILE" | cut -d: -f2 | sed 's/^..//' | sed 's/.\{3\}$//' > "$DNS_FILE"

# Create backup files for each domain
while read -r line; do
  cli53 export --debug --full "$line" > "$line.txt"
done < "$DNS_FILE"

cd ../

tar czvf "${BACKUP_PATH}.tgz" $BACKUP_PATH

aws s3 cp $BACKUP_PATH.tgz "s3://<bucket-name>/route53/${BACKUP_PATH}.tgz"

rm -rf $BACKUP_PATH

# Prune any tgz files older than 30 days
find *.tgz -type f -mtime +30 -exec rm -f {} \;

# Exit Bash "strict mode"
set +o errexit
set +o nounset
set +o pipefail

exit 0

Expected behaviour

I expected the list and export commands to complete without error.

Actual behaviour

cli53 exceeds the rate limit, receives a 400 bad request response from Route53, and continually retries, keeping the rate limit in effect. The Jenkins job was running for over 6 hours before I discovered that was why no one could make changes to zones and records. It was effectively maintaining the ban.

-----------------------------------------------------
DEBUG: Response route53/ListHostedZones Details:
---[ RESPONSE ]--------------------------------------
HTTP/1.1 400 Bad Request
Connection: close
Content-Length: 255
Content-Type: text/xml
Date: Thu, 04 Jul 2019 11:29:43 GMT
X-Amzn-Requestid: xxxxxxxxx


-----------------------------------------------------
DEBUG: Request route53/ListHostedZones Details:
---[ REQUEST POST-SIGN ]-----------------------------
GET /2013-04-01/hostedzone HTTP/1.1
Host: route53.amazonaws.com
User-Agent: aws-sdk-go/1.13.34 (go1.11.5; linux; 386)
Authorization: AWS4-HMAC-SHA256 Credential=xxxxxxxxxxxx/20190704/us-east-1/route53/aws4_request, SignedHeaders=host;x-amz-date, Signature=xxxxxxxxxxxx
X-Amz-Date: 20190704T112945Z
Accept-Encoding: gzip

Have you checked if the documentation has the information you require?

Yes, I've googled and read and tried sleep 5 between commands. Once that rate limit is exceeded it remains in effect for an unknown time before the "ban" is lifted. AWS uses the term "throttled", but you are effectively unable to use the API for at least 30m.

Could you contribute a fix or help testing with this issue?

I'd love to, but I don't know Go yet. It would be lovely to have an option to turn off retries and fail on the first error. Or, add an option for a delay between requests. Or more lovelier, a retry, delay, backoff strategy.