Add K8s metrics fetching options, allowing configuration and exposing these metrics to the metrics collection stage

Question

Add K8s metrics fetching options, allowing configuration and exposing these metrics to the metrics collection stage

danielmapar opened this issue 3 years ago · comments

Daniel Marchena Parreira commented 3 years ago

Is your feature request related to a problem? Please describe.
I am currently trying to access pods cpu and memory consumption in order to write my own autoscaler. However, I can't seem to find there data inside spec = json.loads(sys.stdin.read())

This is the JSON dump that I get from my metric.py script:

{"resource": {"metadata": {"name": "frontend-6b64dc9665-gx64z", "generateName": "frontend-6b64dc9665-", "namespace": "default", "uid": "8b63127d-ed2a-46c7-a2ac-b188b986d480", "resourceVersion": "82988", "creationTimestamp": "2021-02-14T04:28:38Z", "labels": {"app": "frontend", "pod-template-hash": "6b64dc9665"}, "annotations": {"sidecar.istio.io/rewriteAppHTTPProbers": "true"}, "ownerReferences": [{"apiVersion": "apps/v1", "kind": "ReplicaSet", "name": "frontend-6b64dc9665", "uid": "737d8ebd-1fd4-460b-b6c3-0a07d1981869", "controller": true, "blockOwnerDeletion": true}], "managedFields": [{"manager": "k3s", "operation": "Update", "apiVersion": "v1", "time": "2021-02-14T04:28:52Z", "fieldsType": "FieldsV1", "fieldsV1": {"f:metadata": {"f:annotations": {".": {}, "f:sidecar.istio.io/rewriteAppHTTPProbers": {}}, "f:generateName": {}, "f:labels": {".": {}, "f:app": {}, "f:pod-template-hash": {}}, "f:ownerReferences": {".": {}, "k:{\"uid\":\"737d8ebd-1fd4-460b-b6c3-0a07d1981869\"}": {".": {}, "f:apiVersion": {}, "f:blockOwnerDeletion": {}, "f:controller": {}, "f:kind": {}, "f:name": {}, "f:uid": {}}}}, "f:spec": {"f:containers": {"k:{\"name\":\"server\"}": {".": {}, "f:env": {".": {}, "k:{\"name\":\"AD_SERVICE_ADDR\"}": {".": {}, "f:name": {}, "f:value": {}}, "k:{\"name\":\"CART_SERVICE_ADDR\"}": {".": {}, "f:name": {}, "f:value": {}}, "k:{\"name\":\"CHECKOUT_SERVICE_ADDR\"}": {".": {}, "f:name": {}, "f:value": {}}, "k:{\"name\":\"CURRENCY_SERVICE_ADDR\"}": {".": {}, "f:name": {}, "f:value": {}}, "k:{\"name\":\"ENV_PLATFORM\"}": {".": {}, "f:name": {}, "f:value": {}}, "k:{\"name\":\"PORT\"}": {".": {}, "f:name": {}, "f:value": {}}, "k:{\"name\":\"PRODUCT_CATALOG_SERVICE_ADDR\"}": {".": {}, "f:name": {}, "f:value": {}}, "k:{\"name\":\"RECOMMENDATION_SERVICE_ADDR\"}": {".": {}, "f:name": {}, "f:value": {}}, "k:{\"name\":\"SHIPPING_SERVICE_ADDR\"}": {".": {}, "f:name": {}, "f:value": {}}}, "f:image": {}, "f:imagePullPolicy": {}, "f:livenessProbe": {".": {}, "f:failureThreshold": {}, "f:httpGet": {".": {}, "f:httpHeaders": {}, "f:path": {}, "f:port": {}, "f:scheme": {}}, "f:initialDelaySeconds": {}, "f:periodSeconds": {}, "f:successThreshold": {}, "f:timeoutSeconds": {}}, "f:name": {}, "f:ports": {".": {}, "k:{\"containerPort\":8080,\"protocol\":\"TCP\"}": {".": {}, "f:containerPort": {}, "f:protocol": {}}}, "f:readinessProbe": {".": {}, "f:failureThreshold": {}, "f:httpGet": {".": {}, "f:httpHeaders": {}, "f:path": {}, "f:port": {}, "f:scheme": {}}, "f:initialDelaySeconds": {}, "f:periodSeconds": {}, "f:successThreshold": {}, "f:timeoutSeconds": {}}, "f:resources": {".": {}, "f:limits": {".": {}, "f:cpu": {}, "f:memory": {}}, "f:requests": {".": {}, "f:cpu": {}, "f:memory": {}}}, "f:terminationMessagePath": {}, "f:terminationMessagePolicy": {}}}, "f:dnsPolicy": {}, "f:enableServiceLinks": {}, "f:restartPolicy": {}, "f:schedulerName": {}, "f:securityContext": {}, "f:serviceAccount": {}, "f:serviceAccountName": {}, "f:terminationGracePeriodSeconds": {}}, "f:status": {"f:conditions": {"k:{\"type\":\"ContainersReady\"}": {".": {}, "f:lastProbeTime": {}, "f:lastTransitionTime": {}, "f:status": {}, "f:type": {}}, "k:{\"type\":\"Initialized\"}": {".": {}, "f:lastProbeTime": {}, "f:lastTransitionTime": {}, "f:status": {}, "f:type": {}}, "k:{\"type\":\"Ready\"}": {".": {}, "f:lastProbeTime": {}, "f:lastTransitionTime": {}, "f:status": {}, "f:type": {}}}, "f:containerStatuses": {}, "f:hostIP": {}, "f:phase": {}, "f:podIP": {}, "f:podIPs": {".": {}, "k:{\"ip\":\"10.42.0.42\"}": {".": {}, "f:ip": {}}}, "f:startTime": {}}}}]}, "spec": {"volumes": [{"name": "default-token-mgjm5", "secret": {"secretName": "default-token-mgjm5", "defaultMode": 420}}], "containers": [{"name": "server", "image": "gcr.io/google-samples/microservices-demo/frontend:v0.2.1", "ports": [{"containerPort": 8080, "protocol": "TCP"}], "env": [{"name": "PORT", "value": "8080"}, {"name": "PRODUCT_CATALOG_SERVICE_ADDR", "value": "productcatalogservice:3550"}, {"name": "CURRENCY_SERVICE_ADDR", "value": "currencyservice:7000"}, {"name": "CART_SERVICE_ADDR", "value": "cartservice:7070"}, {"name": "RECOMMENDATION_SERVICE_ADDR", "value": "recommendationservice:8080"}, {"name": "SHIPPING_SERVICE_ADDR", "value": "shippingservice:50051"}, {"name": "CHECKOUT_SERVICE_ADDR", "value": "checkoutservice:5050"}, {"name": "AD_SERVICE_ADDR", "value": "adservice:9555"}, {"name": "ENV_PLATFORM", "value": "gcp"}], "resources": {"limits": {"cpu": "200m", "memory": "128Mi"}, "requests": {"cpu": "100m", "memory": "64Mi"}}, "volumeMounts": [{"name": "default-token-mgjm5", "readOnly": true, "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"}], "livenessProbe": {"httpGet": {"path": "/_healthz", "port": 8080, "scheme": "HTTP", "httpHeaders": [{"name": "Cookie", "value": "shop_session-id=x-liveness-probe"}]}, "initialDelaySeconds": 10, "timeoutSeconds": 1, "periodSeconds": 10, "successThreshold": 1, "failureThreshold": 3}, "readinessProbe": {"httpGet": {"path": "/_healthz", "port": 8080, "scheme": "HTTP", "httpHeaders": [{"name": "Cookie", "value": "shop_session-id=x-readiness-probe"}]}, "initialDelaySeconds": 10, "timeoutSeconds": 1, "periodSeconds": 10, "successThreshold": 1, "failureThreshold": 3}, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "imagePullPolicy": "IfNotPresent"}], "restartPolicy": "Always", "terminationGracePeriodSeconds": 30, "dnsPolicy": "ClusterFirst", "serviceAccountName": "default", "serviceAccount": "default", "nodeName": "eecs6446proj1-1", "securityContext": {}, "schedulerName": "default-scheduler", "tolerations": [{"key": "node.kubernetes.io/not-ready", "operator": "Exists", "effect": "NoExecute", "tolerationSeconds": 300}, {"key": "node.kubernetes.io/unreachable", "operator": "Exists", "effect": "NoExecute", "tolerationSeconds": 300}], "priority": 0, "enableServiceLinks": true, "preemptionPolicy": "PreemptLowerPriority"}, "status": {"phase": "Running", "conditions": [{"type": "Initialized", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-02-14T04:28:38Z"}, {"type": "Ready", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-02-14T04:28:52Z"}, {"type": "ContainersReady", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-02-14T04:28:52Z"}, {"type": "PodScheduled", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-02-14T04:28:38Z"}], "hostIP": "192.168.23.92", "podIP": "10.42.0.42", "podIPs": [{"ip": "10.42.0.42"}], "startTime": "2021-02-14T04:28:38Z", "containerStatuses": [{"name": "server", "state": {"running": {"startedAt": "2021-02-14T04:28:39Z"}}, "lastState": {}, "ready": true, "restartCount": 0, "image": "gcr.io/google-samples/microservices-demo/frontend:v0.2.1", "imageID": "gcr.io/google-samples/microservices-demo/frontend@sha256:ee13a82435b5031a657d6dbc76327b95186e103d28dce5a47c5d835d040e441b", "containerID": "containerd://ed293f79974f4a24b70edb55d9610c4f5430a0b2403f20d8547bf0df527da177", "started": true}], "qosClass": "Burstable"}}, "runType": "scaler"}

Describe the solution you'd like
I would like to be able to access cpu and memory consumption by default inside the spec JSON object.

Describe alternatives you've considered
By looking at examples I saw some autoscalers calling others pods /metrics endpoint. I supposed this is a solution, but I wish this information could be gathered by the framework via the Kubernetes rest api.

Extra context

metrics.py

import os
import json
import sys

def main():
    # Parse spec into a dict
    spec = json.loads(sys.stdin.read())
    metric(spec)

def metric(spec):
    sys.stderr.write(json.dumps(spec))
    sys.stderr.write("No 'numPods' label on resource being managed")
    exit(1)

if __name__ == "__main__":
    main()

cpa.yaml

apiVersion: custompodautoscaler.com/v1
kind: CustomPodAutoscaler
metadata:
  name: python-custom-autoscaler-frontend
spec:
  template:
    spec:
      containers:
      - name: python-custom-autoscaler-frontend
        image: danielmapar/python-custom-autoscaler:v1.3
        imagePullPolicy: IfNotPresent
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend
  config:
    - name: interval
      value: "10000"

config.yaml

evaluate:
  type: "shell"
  timeout: 2500
  shell:
    entrypoint: "python"
    command:
      - "/evaluate.py"
metric:
  type: "shell"
  timeout: 2500
  shell:
    entrypoint: "python"
    command:
      - "/metric.py"
runMode: "per-pod"

Jamie Thompson · Answer 1 · Mon Apr 05 2021 05:53:07 GMT+0800 (China Standard Time)

Hey, thanks for raising this issue.

At the minute the Custom Pod Autoscaler does not gather any data from the K8s metrics server; but I definitely think this would be a good feature, since I imagine many autoscalers would use this info. Perhaps it could be included as you've suggested in the metric gathering stage of the pipeline in the JSON payload sent to the metric gathering stage.

The Horizontal Pod Autoscaler as a Custom Pod Autoscaler queries the metrics server, so some of the code could be taken from here and adapted to include in the Custom Pod Autoscaler pipeline.

A simplified view of this pipeline would be:

Gather K8s resource information (Pod/Deployment/StatefulSet etc.).
Query the metrics server if the metrics server is available using the resource information retrieved.

If the metrics server is not available, skip this step and continue.
If the resource information is not available yet in the metrics server, skip this step and continue (for example if there are no metrics for the resource uploaded yet).

Combine the K8s resource information with any gathered metrics information, serialising into JSON.
Call the metrics stage, providing the JSON generated in the previous step.
Get the results of the metrics stage, continue as normal.
...

This could probably be built on further to introduce some flexibility, with extra configuration options:

includeResourceMetrics = If true will include CPU/Memory/Metrics server metrics if available, if false will skip and not include (default false).
requireResourceMetrics = If true will require metrics server information, if the server is not available or the metrics for the resource are not available then it will fail with an appropriate error, if false this will not occur (default false). Only applies if includeResourceMetrics is true.

Does this feature sound like it would address your needs? If so I will set it as next on my to-do list for this project.

Daniel Marchena Parreira · Answer 2 · Mon Apr 05 2021 06:00:22 GMT+0800 (China Standard Time)

Yes, that would be perfect. Otherwise, another solution would be a step by step guide on how to integrate the metric step (metrics.py) with Prometheus or any famous metric gatherer.

Based on your answer I will just call the k8s api inside of the metrics step and get that information.

Thanks for the quick response, appreciate it.

Jamie Thompson · Answer 3 · Thu Apr 08 2021 07:58:35 GMT+0800 (China Standard Time)

This is now available in the v1.1.0 release of the Custom Pod Autoscaler. Make sure you use the new v1.1.0 release of the Custom Pod Autoscaler Operator too, since it includes the roleRequiresMetricsServer flag for Custom Pod Autoscalers which should make it a bit easier, as it will handle setting up access to the K8s metrics server for your autoscaler.