Metrics collection

Question

Metrics collection

riverma opened this issue 2 years ago · comments

Brief Description

We need methods to capture some metrics to support our load testing goals. In some cases, there may be existing tools we can leverage already available in the PCM or by TPS tools, but in other cases, we may need to write custom scripts for metric collection.

Expected Behavior

Capability to capture the following metrics:
- Accumulated size (in bytes) of a given AWS S3 bucket over a given time frequency (down to minutes)
- Throughput (in bytes/sec) of a given AWS S3 bucket over a given time frequency (down to minutes)
- Elasticsearch statistics (num docs, query time, etc.) for a given index over a given time frequency (down to minutes)
- Elasticsearch instance health over time (RAM, storage)
- PCM queue sizes (QUEUED / PENDING jobs especially) over a given time frequency (down to minutes)
- AWS EC2 spot errors (insufficient capacity/terminations)
Capability to generate a CSV with metric values over time to support plotting

Current Behavior

The above have to be manually calculated by hand currently.

Suggested Ideas on Resolution

There are likely tools already available to capture all the above. Likely, we just need to create a script which can obtain the information above and post into a CSV file for easy plotting.

Some suggested resources to evaluate:

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html
https://www.sumologic.com/blog/monitor-aws-s3-metrics/
https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html
HySDS UI -> RabbitMQ describes number of queued items (for a given queue) within PCM
etc.

Rishi Verma · Answer 1 · Sat May 28 2022 08:05:29 GMT+0800 (China Standard Time)

@hhlee445 @chrisjrd @maseca - FYI and to provide guidance to @philipjyoon.

niarenaw · Answer 2 · Tue Jun 07 2022 04:27:58 GMT+0800 (China Standard Time)

A lot of these metrics would be directly impacted by the number of autoscaling fleets and their maximum sizes. Do we want to standardize these?

Philip Yoon · Answer 3 · Thu Jun 16 2022 12:40:36 GMT+0800 (China Standard Time)

Some ideas from @niarenaw

Accumulated size (in bytes) of a given AWS S3 bucket over a given time frequency (down to minutes)
- can do this programmatically by running the following command before and after the test and taking the difference: aws s3 ls s3://$BUCKET --recursive --summarize --human-readable
- can also compute s3 size on the aws s3 console
Throughput (in bytes/sec) of a given AWS S3 bucket over a given time frequency (down to minutes)
- can derive from previous metric and total length of load test
- better granularity with Metrics tab on aws s3 console
Elasticsearch statistics (num docs, query time, etc.) for a given index over a given time frequency (down to minutes)
- can use elasticsearch sdk or use the web ui to generate queries and filter by time range
- I’m pretty horrible at the elasticsearch DSL syntax, but might be time I learn it properly
PCM queue sizes (QUEUED / PENDING jobs especially) over a given time frequency (down to minutes)
- probably easiest to get these using Figaro and Lucene queries (ex. “job_queue:<> AND timestamp:<>” for each queue)
- can make these programmatic by querying ES directly instead
AWS EC2 spot errors (insufficient capacity/terminations)
- using AWS cloudtrail conosole, can search for BidEvictedEvent events in given time range

Philip Yoon · Answer 4 · Thu Jun 16 2022 12:54:21 GMT+0800 (China Standard Time)

Thoughts on S3 size: What Nick has found seems to be the only way we can get near-real-time and high-frequency metrics on S3 bucket size. However it can get very slow for large buckets as well as costly. I think something like 0.005 cents per object query?

I did find an alternative using cloudwatch but it only works at daily frequency so not much useful to us:

aws --profile saml-pub cloudwatch get-metric-statistics --namespace AWS/S3 --start-time 2022-06-08T23:22:00 --end-time 2022-06-08T23:59:00 --period 86400 --statistics Average --metric-name BucketSizeBytes --dimensions Name=BucketName,Value=opera-dev-isl-fwd-pyoon Name=StorageType,Value=StandardStorage

Perhaps there are other metrics we can measure instead that can give us the same/similar insight into what's happening in the PCM and where the bottlenecks lie. If we are looking to see if the ingest workers are lagging behind the download workers (this is what high-frequency ISL S3 accum size would tell us) we could just measure the length of queue that which the ingest workers consume? I don't know if these queue entries would have file size in them; however, at least for HSLS and HSLL data, file sizes seem to be quite uniform.