nasa / opera-sds-int

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metrics collection

riverma opened this issue · comments

Brief Description

We need methods to capture some metrics to support our load testing goals. In some cases, there may be existing tools we can leverage already available in the PCM or by TPS tools, but in other cases, we may need to write custom scripts for metric collection.

Expected Behavior

  • Capability to capture the following metrics:
    • Accumulated size (in bytes) of a given AWS S3 bucket over a given time frequency (down to minutes)
    • Throughput (in bytes/sec) of a given AWS S3 bucket over a given time frequency (down to minutes)
    • Elasticsearch statistics (num docs, query time, etc.) for a given index over a given time frequency (down to minutes)
    • Elasticsearch instance health over time (RAM, storage)
    • PCM queue sizes (QUEUED / PENDING jobs especially) over a given time frequency (down to minutes)
    • AWS EC2 spot errors (insufficient capacity/terminations)
  • Capability to generate a CSV with metric values over time to support plotting

Current Behavior

The above have to be manually calculated by hand currently.

Suggested Ideas on Resolution

There are likely tools already available to capture all the above. Likely, we just need to create a script which can obtain the information above and post into a CSV file for easy plotting.

Some suggested resources to evaluate:

@hhlee445 @chrisjrd @maseca - FYI and to provide guidance to @philipjyoon.

A lot of these metrics would be directly impacted by the number of autoscaling fleets and their maximum sizes. Do we want to standardize these?

Some ideas from @niarenaw

  • Accumulated size (in bytes) of a given AWS S3 bucket over a given time frequency (down to minutes)
    • can do this programmatically by running the following command before and after the test and taking the difference: aws s3 ls s3://$BUCKET --recursive --summarize --human-readable
    • can also compute s3 size on the aws s3 console
  • Throughput (in bytes/sec) of a given AWS S3 bucket over a given time frequency (down to minutes)
    • can derive from previous metric and total length of load test
    • better granularity with Metrics tab on aws s3 console
  • Elasticsearch statistics (num docs, query time, etc.) for a given index over a given time frequency (down to minutes)
    • can use elasticsearch sdk or use the web ui to generate queries and filter by time range
    • I’m pretty horrible at the elasticsearch DSL syntax, but might be time I learn it properly
  • PCM queue sizes (QUEUED / PENDING jobs especially) over a given time frequency (down to minutes)
    • probably easiest to get these using Figaro and Lucene queries (ex. “job_queue:<> AND timestamp:<>” for each queue)
    • can make these programmatic by querying ES directly instead
  • AWS EC2 spot errors (insufficient capacity/terminations)
    • using AWS cloudtrail conosole, can search for BidEvictedEvent events in given time range

Thoughts on S3 size: What Nick has found seems to be the only way we can get near-real-time and high-frequency metrics on S3 bucket size. However it can get very slow for large buckets as well as costly. I think something like 0.005 cents per object query?

I did find an alternative using cloudwatch but it only works at daily frequency so not much useful to us:

aws --profile saml-pub cloudwatch get-metric-statistics --namespace AWS/S3 --start-time 2022-06-08T23:22:00 --end-time 2022-06-08T23:59:00 --period 86400 --statistics Average --metric-name BucketSizeBytes --dimensions Name=BucketName,Value=opera-dev-isl-fwd-pyoon Name=StorageType,Value=StandardStorage

Perhaps there are other metrics we can measure instead that can give us the same/similar insight into what's happening in the PCM and where the bottlenecks lie. If we are looking to see if the ingest workers are lagging behind the download workers (this is what high-frequency ISL S3 accum size would tell us) we could just measure the length of queue that which the ingest workers consume? I don't know if these queue entries would have file size in them; however, at least for HSLS and HSLL data, file sizes seem to be quite uniform.