argoproj / argo-workflows

Workflow Engine for Kubernetes

Home Page:https://argo-workflows.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EKS Pod Identity S3 Artifact is not working on 3.5.5

rafilkmp3 opened this issue · comments

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

I have the following configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: workflows-artifact-repository
  namespace: workflows
data:
  v2-s3-artifact-repository: |
    s3:
      bucket: redacted-prod-artifacts
      endpoint: s3.amazonaws.com
      region: us-east-2
      useSDKCreds: true

The service account is already set up correctly and the minio client is assuming the correct role with more than enough permissions to S3. Also, the role already have the Trust Relationship as:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:TagSession",
                "sts:AssumeRole"
            ]
        }
    ]
}

Version

v3.5.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

spec:
  workflowSpec:
    templates:
      - name: main
        inputs: {}
        outputs: {}
        nodeSelector:
          kubernetes.io/arch: amd64
        metadata: {}
        steps:
          - - name: generate-data
              template: generate-data
              arguments:
                parameters:
                  - name: lookback_hours
                    value: '1'

and this is the template used:

      - name: generate-data
        inputs:
          parameters:
            - name: lookback_hours
              value: '1'
        outputs:
          artifacts:
            - name: interactions-csv
              path: /app/interactions.csv
            - name: items-csv
              path: /app/items.csv
            - name: users-csv
              path: /app/users.csv
        nodeSelector:
          kubernetes.io/arch: amd64
        metadata: {}
        script:
          name: ''
          image: >-
            redacted.dkr.ecr.us-east-2.amazonaws.com/personalize-updater:latest
          command:
            - bash
          resources:
            limits:
              memory: 12Gi
            requests:
              cpu: '2'
              memory: 12Gi
          source: >
            set -exu

            poetry run python -m src.main --lookback_hours
            {{inputs.parameters.lookback_hours}}
        serviceAccountName: personalize-updater-serviceaccount
        podSpecPatch: >-
          {"containers":[{"name":"wait","resources":{"limits":{"cpu":"{{workflow.parameters.cpu-limit}}","memory":"{{workflow.parameters.mem-limit}}"}}}]}

Logs from the workflow controller

I believe it is irrelevant for that case, the controller logs are only showing there is nothing related to the issue.

Logs from in your workflow's wait container

│ time="2024-04-17T19:35:44.681Z" level=info msg="Starting Workflow Executor" version=v3.5.5                                                                                                                   │
│ time="2024-04-17T19:35:44.683Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5                                                                                     │
│ time="2024-04-17T19:35:44.683Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=workflows podName=personalize-hourly-updater-2gpmv-genera │
│ te-data-3777018436 templateName=generate-data version="&Version{Version:v3.5.5,BuildDate:2024-02-29T20:59:20Z,GitCommit:c80b2e91ebd7e7f604e88442f45ec630380effa0,GitTag:v3.5.5,GitTreeState:clean,GoVersion: │
│ go1.21.7,Compiler:gc,Platform:linux/amd64,}"                                                                                                                                                                 │
│ time="2024-04-17T19:35:44.696Z" level=info msg="Starting deadline monitor"                                                                                                                                   │
│ time="2024-04-17T19:36:08.706Z" level=info msg="Main container completed" error="<nil>"                                                                                                                      │
│ time="2024-04-17T19:36:08.706Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"                                                                                     │
│ time="2024-04-17T19:36:08.706Z" level=info msg="No output parameters"                                                                                                                                        │
│ time="2024-04-17T19:36:08.706Z" level=info msg="Saving output artifacts"                                                                                                                                     │
│ time="2024-04-17T19:36:08.706Z" level=info msg="Staging artifact: interactions-csv"                                                                                                                          │
│ time="2024-04-17T19:36:08.706Z" level=info msg="Copying /app/interactions.csv from container base image layer to /tmp/argo/outputs/artifacts/interactions-csv.tgz"                                           │
│ time="2024-04-17T19:36:08.706Z" level=info msg="/var/run/argo/outputs/artifacts/app/interactions.csv.tgz -> /tmp/argo/outputs/artifacts/interactions-csv.tgz"                                                │
│ time="2024-04-17T19:36:08.706Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/interactions-csv.tgz, key: personalize-hourly-updater-2gpmv/personalize-hourly-updater-2gpmv-generate-data-3777018 │
│ 436/interactions-csv.tgz"                                                                                                                                                                                    │
│ time="2024-04-17T19:36:08.714Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::redacted:role/prod-cluster-v11-personalize-20240411174459608400000001"       │
│ 2024/04/17 19:36:08 Ignoring, HTTP credential provider invalid endpoint host, "169.254.170.23", only loopback hosts are allowed. <nil>                                                                       │
│ time="2024-04-17T19:36:08.778Z" level=warning msg="Non-transient error: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseError │
│ s"                                                                                                                                                                                                           │
│ time="2024-04-17T19:36:08.778Z" level=info msg="Save artifact" artifactName=interactions-csv duration=72.185513ms error="failed to create new S3 client: NoCredentialProviders: no valid providers in chain. │
│  Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" key=personalize-hourly-updater-2gpmv/personalize-hourly-updater-2gpmv-generate-data-3777018436/interactions-csv.tgz      │
│ time="2024-04-17T19:36:08.778Z" level=error msg="executor error: failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.Cre │
│ dentialsChainVerboseErrors"                                                                                                                                                                                  │
│ time="2024-04-17T19:36:08.802Z" level=info msg="Alloc=9319 TotalAlloc=16464 Sys=23653 NumGC=4 Goroutines=8"                                                                                                  │
│ time="2024-04-17T19:36:08.810Z" level=fatal msg="failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVer │
│ boseErrors"

v3.5.5

#12651 (the fix for #12650 that you commented on) was not backported to 3.5.x.
You could test it with the :latest image

  • I can confirm the issue exists when I tested with :latest

v3.5.5 is not :latest, it is the latest stable.

v3.5.5

#12651 (the fix for #12650 that you commented on) was not backported to 3.5.x. You could test it with the :latest image

  • I can confirm the issue exists when I tested with :latest

v3.5.5 is not :latest, it is the latest stable.

Thank you, it works on :latest, I was using 3.5.5 that was the latest stable release at the time. Do you know if this is arriving on latest stable any time soon? I see that 3.5.6 was released recently but doesn't include that change and I'd like to keep a defined tag instead of :latest.

We generally follow this doc: https://argo-workflows.readthedocs.io/en/latest/releases/

#12651 wasn't a security patch, so it didn't get backported. Unless it gets backported, it won't be released till 3.6.
We are just about on the next minor release cycle, so I asked about starting it in the last Contributor Meeting with alphas since 3.5 is still buggy/unstable (primarily #12025 and related due to #11121). It's currently deferred due to the 3.5 bugginess. As such I imagine a 3.6 RC and then stable won't be out for a few months at least.

You can build your own custom images with that cherry-picked into v3.5.6, for instance. Or you can use the commit hash instead of :latest if you're fine with a dev build.

Thank you, it works on :latest

Closing since it does work as intended.

Also, since you got it working, would you be interested in documenting it yourself? Since it's still missing docs (and I haven't used it so I can't write it myself).