xlgao-zju / argo-chaos-mesh-plugin

An argo plugin for executing chaos mesh experiment.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] using plugin VS using resource template

zhuwenxing opened this issue · comments

Hi, I got there from the chaos mesh monthly meetup.
It is an interesting project, and this is my first time knowing that we can build the plugin for Argo workflow.

Argo workflow + Chaos Mesh is also used in my team, but the way we implement is by using the resource template in Argo workflow directly.

example

    - name: deploy-chaos
      resource:
        action: apply
        manifest: |
          kind: StressChaos
          apiVersion: chaos-mesh.org/v1alpha1
          metadata:
            name: memory-stress-{{workflow.parameters.label-instance}}-{{workflow.parameters.label-component}}
            namespace: default
          spec:
            selector:
              namespaces:
                - default
              labelSelectors:
                app.kubernetes.io/name: milvus
                app.kubernetes.io/instance: {{workflow.parameters.label-instance}}
                app.kubernetes.io/component: {{workflow.parameters.label-component}}
            mode: all
            value: "2"
            stressors:
              memory:
                workers: {{workflow.parameters.memory-workers}}
                size: {{workflow.parameters.memory-size}}
            duration: {{workflow.parameters.chaos-duration}}

I roughly went through the demo and code, It seems that you wrap the manifest into a body, parse it, and then use a client to apply it.
It seems to make things more complicated.

@zhuwenxing Yes, we can use your way to create chaos mesh experiment. And it is convenient. But it is kind of too simple.

Since this project is still in the very early stages. We can only inject an recover the experiment for now.

But we can extend the capabilities of the plugin in several ways, which resource template in Argo workflow could not do.

e.g.

  • We can store the details of the experiment in a database.
  • Although succeed to create chaos mesh experiment in k8s, the injection may be failed. We can calculate the state of experiment, and return error when the injection failed. So that Argo will know this step of workflow is failed and terminate the workflow(according to setting).
  • We can add experiment-related logic. For example, after we execute a podkill experiment, we can use label selector to get new pods. So that we can verify the self-recovery of the system by the state of new pods.
  • When terminating a argo workflow, we can resume the experiment to ensure that all failures are recovered.