Throttling Exception

Question

Throttling Exception

CycleMark opened this issue 4 years ago · comments

Hi,

I've changed the way my robots work. Basically CloudWatch is now running all the time. I suspect it's not the amount of data getting uploaded (given the error) but the frequency.

I've changed publish_frequency from the default 5 sec to 120 but still after a while this error gets thrown and I can't upload more data.

Any suggestions?

Thanks

Mark

[ERROR] [1594043186.700089703]: [AWSClient] HTTP response code: 400
Exception name: ThrottlingException
Error message: Rate exceeded
5 response headers:
connection : close
content-length : 58
content-type : application/x-amz-json-1.1
date : Mon, 06 Jul 2020 13:46:26 GMT
[ WARN] [1594043186.700169487]: [AWSClient] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[ERROR] [1594043186.700320532]: [SendLogsRequest] Send log request failed due to: Rate exceeded, with error code: 13
[ERROR] [1594043186.700381692]: [SendLogsToCloudWatch] Failed to send to CloudWatch in Log Group: mxnet_robotics Log Stream: mxnet_kobuki_robot_log_stream with error code: 1
[ WARN] [1594043186.700663183]: [SendLogs] Unable to send logs to CloudWatch, retrying ...

issue-label-bot · Answer 1 · Tue Jul 07 2020 16:26:18 GMT+0800 (China Standard Time)

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.77. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

Anas Abou Allaban · Answer 2 · Wed Jul 08 2020 00:45:04 GMT+0800 (China Standard Time)

Hi @CycleMark!

Few questions on your setup:

What does your configuration look like?
Do you have multiple robots and are you using the same account for them?
Are you batching requests at all? Take a look at this AWS article which can give you some insight into why you're being throttled.

Mark · Answer 3 · Mon Jul 13 2020 22:14:02 GMT+0800 (China Standard Time)

HI,

Sorry for the late reply - I didn't see the notification.

Config at the moment it just one testbed robot.

yaml config file is pasted below - but it looks okay to me. Batching should be every couple of minutes.

Kind Regards

Mark

mxnet_cloudwatch_logger_application.yaml


# the frequency to send a batch of logs to CloudWatch Logs in the log group and log stream specified
# e.g. publish a batch of logs every 6 seconds
#      publish_frequency: 6.0
# default value is: 5.0 seconds
publish_frequency: 120.0

# whether to subscribe to the rosout_agg topic to get logs
# default value is: true
sub_to_rosout: true

# other topics to subscribe to get logs
# e.g. subscribe to two topics, one is named topic1, the other is named topic2
#      topics: ['topic1', 'topic2']
# default value is: empty list
topics: []

# list of node names to ignore logs from
# e.g. To ignore all logs from a node named 'Talker' you would use the following configuration:
#      ignore_nodes: ["/Talker"]
ignore_nodes: ["/cloudwatch_logger", "/cloudwatch_metrics_collector"]

#      min_log_verbosity: INFO
# default value is: DEBUG == logs of all log verbosity levels get sent to CloudWatch Logs
min_log_verbosity: INFO

# The absolute path to a folder that all offline logs will be stored in
storage_directory: "~/.ros/cwlogs/"

# The maximum size of all files in offline storage in KB
storage_limit: 1048576

# This is the AWS Client Configuration used by the AWS service client in the Node. If given the node will load the
# provided configuration when initializing the client.
aws_client_configuration:
  # in an aws account, you can switch to a different region using the drop-down on the upper right corner
  # logs sent to CloudWatch Logs will appear in the region indicated below
  # default value is: "us-west-2"
  region: "eu-west-2"

  # Values that determine the length of time, in milliseconds, to wait before timing out a request. You can increase
  # this value if you need to transfer large files, such as in Amazon S3 or Amazon CloudFront.
  connect_timeout_ms: 2000
  request_timeout_ms: 2000

  # The retry strategy used when connection requests are attempted. If set to true then requests
  # will fail fast, otherwise will use an exponential retry algorithm defined by the AWS SDK.
  no_retry_strategy: false

Devin Bonnie · Answer 4 · Tue Jul 14 2020 00:11:33 GMT+0800 (China Standard Time)

@CycleMark what are the contents of the folder ~/.ros/cwlogs/?

Devin Bonnie · Answer 5 · Thu Aug 06 2020 13:10:54 GMT+0800 (China Standard Time)

@CycleMark any update? Sorry for the delay: we are actively looking at this issue.

Emerson Knapp · Answer 6 · Tue Aug 25 2020 02:46:00 GMT+0800 (China Standard Time)

@CycleMark hello!

First, another question for our education - which ROS distribution are you running on? Melodic, Kinetic?

To your issue, our current best guess is that the robot lost its connection to the network for some amount of time, and while it was offline logs were stored to disk to be uploaded later. When it came back online, the backed up logs were published too quickly and hit the API limit. The size of these requests is determined by the file_upload_batch_size parameter, which defaults to 50. We believe your problem may be mitigated by increasing this value to something like 500.

Meanwhile we are working on introducing a mechanism to rate-limit this "offline catch-up" case.

Any extra context to help us understand if this is actually the case you've run into would help us solve your problem more effectively.

Emerson Knapp · Answer 7 · Thu Oct 15 2020 04:54:17 GMT+0800 (China Standard Time)

We've closed this issue through aws-robotics/cloudwatch-common#61, which limits the rate that cloudwatch logs may be uploaded - preventing hitting the throttling exception.

We will be bloom-releasing a new version 1.1.5 of cloudwatch-common shortly, which will make it into relevant distributions in the next sync.

If there is any remaining issue, please feel free to reopen this or open a new ticket.