kubeflow / pipelines

Machine Learning Pipelines for Kubeflow

Home Page:https://www.kubeflow.org/docs/components/pipelines/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

S3 support in Kubeflow Pipelines

Jeffwan opened this issue · comments

Create this parent issue to includes all known S3 problems in KFP and I also like to talk about more feature requests in this ticket.

Use cases

Replace Minio with S3 for Argo artifact store and KFP Pipeline store.

[] Manifest changes in Kubeflow/manifest and standalone kubeflow/pipeline/manifest to easily change from minio to S3
[] IRSA - bump minio-sdk-go version to support IRSA
[] IRSA - bump argo workflow version to make sure runner can persist artifact to S3

Relate issues and PRs

UI

[x] KFP UI should be able to read source file in S3, for example

mlpipeline-ui-metadata -> minio://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz

{
  "outputs": [
    {
      "source": "s3://your_bucket/README.md",
      "type": "markdown"
    }
  ]
}

[] KFP UI should be able to read artifact files in S3

mlpipeline-ui-metadata -> s3://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz

{
  "outputs": [
    {
      "source": "s3://your_bucket/README.md",
      "type": "markdown"
    }
  ]
}

[] IRSA - Bump minio-js version to support IRSA

SDK

[x] Use can declare custom S3 artifact location inside the pipeline.
[x] User can apply AWS credentials to pipeline pods to get access to S3.
[x] User can specify service account for pipeline. (IRSA)

More examples

/kind improvement
/area frontend
/area backend

/cc @eterna2 @discordianfish

@Jeffwan: The label(s) kind/improvement cannot be applied, because the repository doesn't have them

In response to this:

Create this parent issue to includes all known S3 problems in KFP and I also like to talk about more feature requests in this ticket.

Use cases

Replace Minio with S3 for Argo artifact store and KFP Pipeline store.

[] Manifest changes in Kubeflow/manifest and standalone kubeflow/pipeline/manifest to easily change from minio to S3
[] IRSA - bump minio-sdk-go version to support IRSA
[] IRSA - bump argo workflow version to make sure runner can persist artifact to S3

Relate issues and PRs

UI

[x] KFP UI should be able to read source file in S3, for example

mlpipeline-ui-metadata -> minio://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz

{
 "outputs": [
   {
     "source": "s3://your_bucket/README.md",
     "type": "markdown"
   }
 ]
}

[] KFP UI should be able to read artifact files in S3

mlpipeline-ui-metadata -> s3://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz

{
 "outputs": [
   {
     "source": "s3://your_bucket/README.md",
     "type": "markdown"
   }
 ]
}

[] IRSA - Bump minio-js version to support IRSA

SDK

[x] Use can declare custom S3 artifact location inside the pipeline.
[x] User can apply AWS credentials to pipeline pods to get access to S3.
[x] User can specify service account for pipeline. (IRSA)

More examples

/kind improvement
/area frontend
/area backend

/cc @eterna2 @discordianfish

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SDK: Use can declare custom S3 artifact location inside the pipeline.

This is deprecated. The supported method is to configure the artifact location in the workflow-configmap

There is also an option to use Minio gateway to proxy requests to the actual storage system (Google Cloud Storage or Amazon S3). WDYT?

@Ark-kun artifact location in workflow-configmap works for us now.
Do you have any docs or code reference on minio gateway?

Do you have any docs or code reference on minio gateway?

Here is the Minio doc: https://docs.minio.io/docs/minio-gateway-for-s3.html

@Jeffwan

Update metadata writer service to support S3 artifacts
Metadata writer service currently hardcode metadata artifacts url to minio. Need to investigate how to infer whether the artifact store endpoint is s3 or minio for metadata service.

def argo_artifact_to_uri(artifact: dict) -> str:

See also

https://kubeflow.slack.com/archives/CE10KS9M4/p1586408517128600

Metadata writer service currently hardcode metadata artifacts url to minio.

The frontend translates this to ReadArtifact backend calls.

Regarding S3:
I think there are two distinct cases:
Artifacts. URIs should be based on the workflow-controller-configmap s3 bucket configuration
mlpipeline-ui-metadata entries that might have any kind of URIs.

@Ark-kun minio gateway seems an extra layer under minio? User need to use minio service with gateway to delegate request to S3. Is that correct? I think that solve some of the requirements and kind of easy but brings extra layer.

mlpipeline-ui-metadata entries that might have any kind of URIs.

Yes. For sure, the problem is it's always minio here, if user want to use S3 to replace minio. Currently, it's not working well.

I'm confused as well. If somebody could outline what steps need to be taken to support s3, I can help implementing them. But reading the comments here leaves me none the wiser.

@Ark-kun minio gateway seems an extra layer under minio? User need to use minio service with gateway to delegate request to S3. Is that correct? I think that solve some of the requirements and kind of easy but brings extra layer.

Yes. It's served as proxy. The advantage of using it is that all services (UX, Backend, Argo etc) only need to talk to one kind of service - in-cluster Minio. And the configuration (credentials etc) can be specified in one place. @IronPan or @numerology might have more information.

@Ark-kun I checked gateway and run some tests. This is working fine. I can file a PR to aws overlay. The current question is some users don't like to use minio as gateway, do we want to support direct object store access? The PRs listed above aims to solve this problem.

@rmgogogo e2fyi repo can not replace entire minio. Still needs some changes.

Isn't minio suppose to be compatible to s3? I don't understand the need for the gateway.

How do we authenticate clients connecting to the minio gateway? Does that require an additional layer of authentication? Or is this enforced by istio on a networking level?

Not convinced this is better than granting individual components IAM permissions.

@discordianfish Gateway is kind of easy in current manifest structure. KFP just need S3 permission and minio will becomes a proxy to S3. There's no need to make code changes everywhere to check protocol. If you see above PRs, It's kind of painful now. We also need to give credential or attach roles in every KFP component talks to S3.

I agree this won't meet everyone's need, we definitely need low level object storage configuration.

@Jeffwan What I don't understand is why we need to distinguish between minio and s3 at all, instead of just tracking the bucket name and the endpoint. The minio client would be configured with either auth credentials or fall back to the instance role. That should be the same config for all components, no?

But okay, if people want to go down the gateway route: How do all components that need s3/minio access authenticate against the gateway? Not using auth here would be a no go for us.

@discordianfish

What I don't understand is why we need to distinguish between minio and s3 at all, instead of just tracking the bucket name and the endpoint.

I see. It's not consistency across all components. KFP has lots of references has fixed protocol like minio:// which won't work for S3.

But okay, if people want to go down the gateway route: How do all components that need s3/minio access authenticate against the gateway? Not using auth here would be a no go for us.

The only difference between minio and minio gateway with S3 is, user need to give credential with s3 policy to minio pod. instance profile will work as well. I have not checked with IRSA or Kube2IAM for fine grain control yet.

Just want to add a note.

One main reason why we moved to s3 instead of minio gateway (this was our original deployment) is because tensorboard doesn't work with minio.

It only support gcs and s3. It is using boto3 instead of minio the last time I checked (maybe 6 months ago?).

Other than that I think minio gateway is fine as long as the throughput is fine. But the concern is that u can't have fine grain rbac control with minio gateway.

And it is confusing to our data scientists when they have to mix between s3 and minio (i.e. s3 for our data, minio for data generated within kfp).

I see. It's not consistency across all components. KFP has lots of references has fixed protocol like minio:// which won't work for S3.

Yes, but why? The protocol part in that url is just for the client to know where to connect so. So for minio and s3, I suggest we always use the endpoint, bucket name and path to refer to an object. It shouldn't matter for the client whether this is s3 or minio. Does that make sense?

The only difference between minio and minio gateway with S3 is, user need to give credential with s3 policy to minio pod. instance profile will work as well. I have not checked with IRSA or Kube2IAM for fine grain control yet.

I understand that, but we need to authenticate access from kubeflow components to the minio gateway. If I understand you correctly, with minion gw everything that is able to reach minion will have access to all s3 content without auth. That's not acceptable for us.

I see. It's not consistency across all components. KFP has lots of references has fixed protocol like minio:// which won't work for S3.

Yes, but why? The protocol part in that url is just for the client to know where to connect so. So for minio and s3, I suggest we always use the endpoint, bucket name and path to refer to an object. It shouldn't matter for the client whether this is s3 or minio. Does that make sense?

https://github.com/kubeflow/pipelines/pull/3531/files#diff-4e5a9c52bbed342532f527bae3aae002R15-R20
This is one example, lots of codes rely on protocol to have custom logic. I get your point and that would be ideal case, we don't alway have endpoint passed with artifact url everywhere. If there's a refactor to have a object will endpoint, bucket, path. Then all the customization can be changed to reply on endpoint. It will work but have more efforts.

The only difference between minio and minio gateway with S3 is, user need to give credential with s3 policy to minio pod. instance profile will work as well. I have not checked with IRSA or Kube2IAM for fine grain control yet.

I understand that, but we need to authenticate access from kubeflow components to the minio gateway. If I understand you correctly, with minion gw everything that is able to reach minion will have access to all s3 content without auth. That's not acceptable for us.

Right, that's what @eterna2 mention as well. Fine grain control doesn't work with minio gateway.

If there's a refactor to have a object will endpoint, bucket, path. Then all the customization can be changed to reply on endpoint. It will work but have more efforts.

Yes, that's what I'm suggesting. Using the gateway and therefor allowing every pod in the cluster, including user notebooks etc, control to any s3 resource via the gateway without further auth is (hopefully?) a no-go for most users.

I have used the minio gateway as a front to Azure Blob Storage, so it does have value until that protocol is supported as well. But I'm definitely in favor of making minio optional asap

So what would be the right way to get consensus on how to go forward? Is this gw based approach on the roadmap? Do we want/need to go down that route first, before eventually having native support?

To address the access control concern, another alternative would be running the gw as a sidecar in each pod we need it. But at that point we can probably just refactor this to track endpoint+bucket+path.

So this becomes a bigger concern for us. Can I submit a PR that would make the endpoint configurable? At least as a stop gap? It seems like that would be sufficient to allow the frontend to fetch the outputs.
Currently I'm getting:

Response for path: artifacts/get?source=minio&bucket=xx&key=yy%2Fzz%2Fmlpipeline-ui-metadata.tgz was not 'ok'

Response was: Failed to get object in bucket xx at path yy/zz/mlpipeline-ui-metadata.tgz: Error: getaddrinfo ENOTFOUND minio-service.kubeflow

Hrm actually it looks like I've configured the apiserver already in /etc/ml-pipeline-config/config.json:

{
  "DBConfig": {
    "DriverName": "mysql",
    "DataSourceName": "",
    "DBName": "mlpipeline"
  },
  "ObjectStoreConfig": {
    "Secure": true,
    "Host": "s3.amazonaws.com",
    "BucketName": "xx"
  },
  "InitConnectionTimeout": "6m",
  "DefaultPipelineRunnerServiceAccount": "pipeline-runner",
  "ML_PIPELINE_VISUALIZATIONSERVER_SERVICE_HOST": "ml-pipeline-ml-pipeline-visualizationserver",
  "ML_PIPELINE_VISUALIZATIONSERVER_SERVICE_PORT": 8888
}

So this becomes a bigger concern for us. Can I submit a PR that would make the endpoint configurable? At least as a stop gap? It seems like that would be sufficient to allow the frontend to fetch the outputs.
Currently I'm getting:

Response for path: artifacts/get?source=minio&bucket=xx&key=yy%2Fzz%2Fmlpipeline-ui-metadata.tgz was not 'ok'

Response was: Failed to get object in bucket xx at path yy/zz/mlpipeline-ui-metadata.tgz: Error: getaddrinfo ENOTFOUND minio-service.kubeflow

It is configurable. Frontend UI has a nodejs server which does a few stuff - most of it proxying to ml-pipeline, but artifacts retrieval is done in the nodejs server.

https://github.com/kubeflow/pipelines/blob/master/frontend/server/configs.ts

I am thinking there are many places to config for kfp. Should we have a unified configmap for everything? Or a crd to config the kfp deployment?

Ah thanks! Maybe consider moving this functionality into the apiserver? Seems excessive to have pipelines split into frontend, frontend-server and backend server

I am trying to use ceph in place of minio.

I have set appropriate values in ml-pipeline-config and workflow-controller-configmap configmap.

I have also set the below env variables in ml-pipeline-ui deployment
ARGO_ARCHIVE_ARTIFACTORY = 's3'
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

When I run the pipeline I can see the data is stored in ceph (verified by connecting via client)

But from UI the source still shows minio:
pipeline/artifacts/get?source=minio&bucket=mlpipelinejun8&key=artifacts%2Fconditional-execution-pipeline-f9m2m%2Fconditional-execution-pipeline-f9m2m-4255012536%2Fflip-coin-output.tgz

Also when I manually changed the source in the above URI to s3 its going to some default port and not the one configured in the configmap
Failed to get object in bucket mlpipelinejun8 at path artifacts/conditional-execution-pipeline-f9m2m/conditional-execution-pipeline-f9m2m-4255012536/flip-coin-output.tgz: Error: connect ECONNREFUSED xx.xx.xx.xx:443

Am I missing something?
@eterna2 @discordianfish @Jeffwan

@SatwikBhandiwad
I am not very familiar with ceph. Is it fully compatible with minio/s3 - i.e. will the minio client work with ceph?

U probably should be look at these env var instead:

    MINIO_ACCESS_KEY = 'minio',
    MINIO_SECRET_KEY = 'minio123',
    MINIO_PORT = '9000',
    MINIO_HOST = 'minio-service',
    MINIO_NAMESPACE = 'kubeflow',
    MINIO_SSL = 'false',

Cuz for S3, we hardcoded (actually I forgot what I wrote, it is either hardcoded or some regex check) to the official AWS endpoint. The differeniation is specifically for AWS IAM instance profile credentials (i.e. if u deploy in AWS, we can get the credential from the metadata endpoint directly).

In non-AWS env, u should be using the MINIO_* implementations. There is no integration with the workflow controller configmap (sadly, maybe we should? - I will raise a ticket for this). So u will need to set 6 env var to be the same as the configmap.

ARGO_ARCHIVE_ARTIFACTORY specifically is for the pod logs archive but as a failover only when we can't get the pod log in the usual manner (from k8s or from the wf CR status).

@eterna2 Thanks for the info.
I configured those env vars and it worked.

I did not understand the workflow controller configmap part

There is no integration with the workflow controller configmap (sadly, maybe we should? - I will raise a ticket for this).

I have set workflow controller configmap with appropriate values and ran a workflow creating artifacts and it worked fine. Am I missing something?

Nope. U r not missing anything. I am just commenting that the problem u encountered is not uncommon - missing out the correct configs.

E.g. ml-pipeline, ml-pipeline-ui, and Argo (aka workflow controller) all share the same artifactory settings, but u still need to config them separately.

Ideally, all kfp components shld be able to access the config map to know the artifactory settings?

commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@Jeffwan What's the current state of this?

Create this parent issue to includes all known S3 problems in KFP and I also like to talk about more feature requests in this ticket.

I'm adding a link to tie this "parent" S3 support issue to a related "child" issue we recently encountered with getting tensorboard visualizations that specify S3 sources to work within KFP using AWS IAM roles: #4364

It took our team a few months of trial-and-error configuration changes which we periodically abandoned and then resumed the task of getting tensorboard visualizations working between KFP version bumps (using v1.0.1 now).

I think the biggest gap that this tensorboard issue surfaced is missing documentation, examples, and support for configuring KFP to work on top of AWS rather than GCP. In addition to this, there are appear to be some general inaccuracies with the Python SDK documentation (see the boldface portion of my comment here which further complicated things).

@discordianfish There's some dependency issues on minio javascript side to use IAM Role for service account. If you use credentials, there's no blocking issues.

@lucinvitae Nice work. thanks for the contribution and I see your published issue,.

What's the current state of things on S3 support in Kubeflow pipelines? It's a little bit hard to follow with some work that seems to have been done towards that and the discussion around using Minio as gateway.

I'm still struggling to get ml-pipeline to initialize correctly against S3 instead of Minio and using IRSA instead of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and I can't seem to get more logs than that even trying to run apiserver with -v=10

$ kubectl logs -f ml-pipeline-7c6d7486d5-dglmh -n kubeflow
I1124 14:26:23.480484       6 client_manager.go:134] Initializing client manager
I1124 14:26:23.480626       6 config.go:50] Config DBConfig.ExtraParams not specified, skipping
F1124 14:26:23.884780       6 client_manager.go:356] Failed to check if Minio bucket exists. Error: Access Denied.

Is IRSA supposed to be supported as suggested in e2fyi/kubeflow-aws.

initMinioClient in pipelines/backend/src/apiserver/client_manager.go seems to suggest that it doesn't or am I missing something obvious?

@jphuynh see https://www.kubeflow.org/docs/aws/pipeline/
Beside IRSA, all rest support has been added. IRSA requires minio client support, I think KFP service backend supports it as well. however, javascripts part is not supported minio/minio-js#841 which means pipeline UI won't be able to use it at this moment

Thanks for your quick answer @Jeffwan.

I think KFP service backend supports it as well.

I don't think it does. Looks like the project is using github.com/minio/minio-go v6.0.14+incompatible and we would need at least v6.0.45

minio/minio-go@a544e9f

I'm not sure if it would be easy to bump the version.

@jphuynh My mistake, I thought it get merged. Yeah, backend side needs to bump minio version as well.I documented in this issue #3398. I will see if I get time to continue on the stable PR

Let me put some efforts here to solve the issue

/assign

I would like to bump this. Has there been any progress here @PatrickXYS? At the moment it is really hard to figure out what is needed to make pipelines work with S3. There seem to be several approaches that community members have hacked themselves, but no clear consensus yet on the standard way to do this.

Also, why is this line hardcoding S3 endpoint as the amazon endpoint? We have a different S3 instance not in Amazon and minio endpoint doesn't work.

return re.search('^.*s3.*amazonaws.com.*$', endpoint)

and here:

return !!endpoint.match(/s3.{0,}\.amazonaws\.com\.?.{0,}/i);

lastly:

return !!endpoint.match(/s3.{0,}\.amazonaws\.com\.?.{0,}/i);

commented

Is assume-web-identity-role to use IAM Role for Service account still an open issue? The thread is not clear on whether this works or not. I am looking to use IAM Role for service account to connect ML Pipeline and Argo artifacts to S3 using this annotation eks.amazonaws.com/role-arn: <your-arn-role> @Jeffwan

Doesn't work yet. Minio version used by pipelines is currently still sitting at github.com/minio/minio-go v6.0.14+incompatible. I think this needs to be merged: #3549

Also this issue also needs to be resolved: minio/minio-js#841. As far as I can tell from piecing together the various PRs and issues on this topic we need both minio-go and minio-js to implement this and then we need the versions to be bumped on KFP's side

commented

thanks for the response. hopefully there is resolution to this soon. I know a lot of people would love to use IRSA for S3 to avid using credentials or third party K8s IAM providers...

Hi everyone, for KFP v2 compatible mode, we've switched to use https://gocloud.dev/ to access S3. Can someone help me verify whether it supports your use-cases especially IRSA and how?
You can find more about the design in bit.ly/kfp-v2-compatible

Hey @Bobgy, I could help with this. Could you give me some more info on what you would like to verify exactly? I guess this will replace the Minio Go and JS clients that are currently in use?

The link bit.ly/kfp-v2-compatible gives me a 404 error.

Thanks @karlschriek! Yes, we'd like to replace MinIO Go with Go CDK (also KFP tasks in pipelines will use Go CDK to upload/download from storage). So it'll be helpful to verify whether Go CDK (as an alternative client for S3) can support the IRSA you are talking about.

Re the 404 error, sorry I made a mistake, fixed the link.

Ok, it looks promising at least: google/go-cloud#2773.

IRSA works by attaching annotations to the ServiceAccounts that get attached to Pods. So all we would need to do is attach the right annotations to the ServiceAccounts for the pipelines api etc, as well as whatever sa is used for the pipeline jobs.

I'll provide you with some small examples and also try out the Go SDK.

Is there already an issue that tracks this? I guess the current issue is probably not the right place to continue this discussion.

I'm struggling to access my object storage bucket created using rook-ceph (s3) and would like to use this as my default artifact and metadata store instead of MinIO. Can anyone please share me what all files to be modified to use MinIO as s3 interface?

I've tried updating workflow-controller-configmap to use my object store

apiVersion: v1
data:
  artifactRepository: |
    archiveLogs: true
    s3:
      endpoint: "rook-ceph-rgw-my-store.rook-ceph.svc.cluster.local:80"
      bucket: "ceph-bkt-xxxx-xxxxx"
      keyFormat: "artifacts/{{workflow.name}}/{{pod.name}}"
      # insecure will disable TLS. Primarily used for minio installs not configured with TLS
      insecure: true
      accessKeySecret:
        name: mlpipeline-minio-artifact
        key: accesskey
      secretKeySecret:
        name: mlpipeline-minio-artifact
        key: secretkey
  containerRuntimeExecutor: docker

Also replace accesskey and secretkey in mlpipeline-minio-artifact secret

Any clue on this will be really helpful.

@Jeffwan Did you have a PR or an example for using minio as a gateway to s3? I've been standing up Kubeflow on AWS and I'm a bit concerned about the number of places the patches that need to go in for S3. Instead of patching all these deployments I'd prefer to just patch the minio components and then be done with it.

@Jeffwan Did you have a PR or an example for using minio as a gateway to s3? I've been standing up Kubeflow on AWS and I'm a bit concerned about the number of places the patches that need to go in for S3. Instead of patching all these deployments I'd prefer to just patch the minio components and then be done with it.

@vinayan3 Have you checked in this https://docs.min.io/docs/minio-gateway-for-s3.html? You can make changes in minio manifests to achieve that. minio gateway works with S3 out of box.

Hi,

Currently is the assume-role-identity sts (temporary access) supported for the workflow to upload into s3 and is there any reference docs for the same?

By temporary access i mean with access key, secret key & security Token.

Is there a time plan, when and in which release will this following feature be supported? @Jeffwan
UI
KFP UI should be able to read artifact files in S3
mlpipeline-ui-metadata -> s3://mlpipeline/artifacts/pipeline-A/pipeline-A-run-id/mlpipeline-ui-metadata.tgz

Hi everyone, please see the proposal/design for using IRSA/IAM roles to access S3 on #8502 and let us know your thoughts.

For anyone looking to use S3 with Kubeflow pipelines(using IAM user credentials), please refer to the latest AWS distribution of Kubeflow docs - https://awslabs.github.io/kubeflow-manifests/

Hey all, I just wanted to share that deployKF uses S3 directly (no minio gateway), which also means that it comes with support for IRSA!

It's going to be documented better soon, but these are the values you need to configure to use S3, happy to help anyone who has trouble getting it working.

If you want, you can also read a bit more about deployKF in this comment: kubeflow/manifests#2451 (comment)