EKS Pod Identity S3 Artifact is not working on 3.5.5
rafilkmp3 opened this issue · comments
Pre-requisites
- I have double-checked my configuration
- I can confirm the issue exists when I tested with
:latest
- I have searched existing issues and could not find a match for this bug
- I'd like to contribute the fix myself (see contributing guide)
What happened/what did you expect to happen?
I have the following configmap:
apiVersion: v1
kind: ConfigMap
metadata:
name: workflows-artifact-repository
namespace: workflows
data:
v2-s3-artifact-repository: |
s3:
bucket: redacted-prod-artifacts
endpoint: s3.amazonaws.com
region: us-east-2
useSDKCreds: true
The service account is already set up correctly and the minio client is assuming the correct role with more than enough permissions to S3. Also, the role already have the Trust Relationship as:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "pods.eks.amazonaws.com"
},
"Action": [
"sts:TagSession",
"sts:AssumeRole"
]
}
]
}
Version
v3.5.5
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
spec:
workflowSpec:
templates:
- name: main
inputs: {}
outputs: {}
nodeSelector:
kubernetes.io/arch: amd64
metadata: {}
steps:
- - name: generate-data
template: generate-data
arguments:
parameters:
- name: lookback_hours
value: '1'
and this is the template used:
- name: generate-data
inputs:
parameters:
- name: lookback_hours
value: '1'
outputs:
artifacts:
- name: interactions-csv
path: /app/interactions.csv
- name: items-csv
path: /app/items.csv
- name: users-csv
path: /app/users.csv
nodeSelector:
kubernetes.io/arch: amd64
metadata: {}
script:
name: ''
image: >-
redacted.dkr.ecr.us-east-2.amazonaws.com/personalize-updater:latest
command:
- bash
resources:
limits:
memory: 12Gi
requests:
cpu: '2'
memory: 12Gi
source: >
set -exu
poetry run python -m src.main --lookback_hours
{{inputs.parameters.lookback_hours}}
serviceAccountName: personalize-updater-serviceaccount
podSpecPatch: >-
{"containers":[{"name":"wait","resources":{"limits":{"cpu":"{{workflow.parameters.cpu-limit}}","memory":"{{workflow.parameters.mem-limit}}"}}}]}
Logs from the workflow controller
I believe it is irrelevant for that case, the controller logs are only showing there is nothing related to the issue.
Logs from in your workflow's wait container
│ time="2024-04-17T19:35:44.681Z" level=info msg="Starting Workflow Executor" version=v3.5.5 │
│ time="2024-04-17T19:35:44.683Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5 │
│ time="2024-04-17T19:35:44.683Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=workflows podName=personalize-hourly-updater-2gpmv-genera │
│ te-data-3777018436 templateName=generate-data version="&Version{Version:v3.5.5,BuildDate:2024-02-29T20:59:20Z,GitCommit:c80b2e91ebd7e7f604e88442f45ec630380effa0,GitTag:v3.5.5,GitTreeState:clean,GoVersion: │
│ go1.21.7,Compiler:gc,Platform:linux/amd64,}" │
│ time="2024-04-17T19:35:44.696Z" level=info msg="Starting deadline monitor" │
│ time="2024-04-17T19:36:08.706Z" level=info msg="Main container completed" error="<nil>" │
│ time="2024-04-17T19:36:08.706Z" level=info msg="No Script output reference in workflow. Capturing script output ignored" │
│ time="2024-04-17T19:36:08.706Z" level=info msg="No output parameters" │
│ time="2024-04-17T19:36:08.706Z" level=info msg="Saving output artifacts" │
│ time="2024-04-17T19:36:08.706Z" level=info msg="Staging artifact: interactions-csv" │
│ time="2024-04-17T19:36:08.706Z" level=info msg="Copying /app/interactions.csv from container base image layer to /tmp/argo/outputs/artifacts/interactions-csv.tgz" │
│ time="2024-04-17T19:36:08.706Z" level=info msg="/var/run/argo/outputs/artifacts/app/interactions.csv.tgz -> /tmp/argo/outputs/artifacts/interactions-csv.tgz" │
│ time="2024-04-17T19:36:08.706Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/interactions-csv.tgz, key: personalize-hourly-updater-2gpmv/personalize-hourly-updater-2gpmv-generate-data-3777018 │
│ 436/interactions-csv.tgz" │
│ time="2024-04-17T19:36:08.714Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::redacted:role/prod-cluster-v11-personalize-20240411174459608400000001" │
│ 2024/04/17 19:36:08 Ignoring, HTTP credential provider invalid endpoint host, "169.254.170.23", only loopback hosts are allowed. <nil> │
│ time="2024-04-17T19:36:08.778Z" level=warning msg="Non-transient error: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseError │
│ s" │
│ time="2024-04-17T19:36:08.778Z" level=info msg="Save artifact" artifactName=interactions-csv duration=72.185513ms error="failed to create new S3 client: NoCredentialProviders: no valid providers in chain. │
│ Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors" key=personalize-hourly-updater-2gpmv/personalize-hourly-updater-2gpmv-generate-data-3777018436/interactions-csv.tgz │
│ time="2024-04-17T19:36:08.778Z" level=error msg="executor error: failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.Cre │
│ dentialsChainVerboseErrors" │
│ time="2024-04-17T19:36:08.802Z" level=info msg="Alloc=9319 TotalAlloc=16464 Sys=23653 NumGC=4 Goroutines=8" │
│ time="2024-04-17T19:36:08.810Z" level=fatal msg="failed to create new S3 client: NoCredentialProviders: no valid providers in chain. Deprecated.\n\tFor verbose messaging see aws.Config.CredentialsChainVer │
│ boseErrors"
v3.5.5
#12651 (the fix for #12650 that you commented on) was not backported to 3.5.x. You could test it with the
:latest
image
- I can confirm the issue exists when I tested with
:latest
v3.5.5 is not
:latest
, it is the latest stable.
Thank you, it works on :latest, I was using 3.5.5 that was the latest stable release at the time. Do you know if this is arriving on latest stable any time soon? I see that 3.5.6 was released recently but doesn't include that change and I'd like to keep a defined tag instead of :latest.
We generally follow this doc: https://argo-workflows.readthedocs.io/en/latest/releases/
#12651 wasn't a security patch, so it didn't get backported. Unless it gets backported, it won't be released till 3.6.
We are just about on the next minor release cycle, so I asked about starting it in the last Contributor Meeting with alphas since 3.5 is still buggy/unstable (primarily #12025 and related due to #11121). It's currently deferred due to the 3.5 bugginess. As such I imagine a 3.6 RC and then stable won't be out for a few months at least.
You can build your own custom images with that cherry-picked into v3.5.6, for instance. Or you can use the commit hash instead of :latest
if you're fine with a dev build.
Thank you, it works on :latest
Closing since it does work as intended.
Also, since you got it working, would you be interested in documenting it yourself? Since it's still missing docs (and I haven't used it so I can't write it myself).