Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!

Home Page:https://metaflow.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"Service token file does not exist" error when deploying flow to Argo from CI

leogargu opened this issue · comments

Description

When running argo-workflows create from a Bitbucket Cloud pipeline, the following error is raised:

$ python main.py --production argo-workflows create --only-json
Metaflow 2.7.19 executing DummyFlow for user:bitbucket
Project: myproj, Branch: prod
Validating your flow...
    The graph looks good!
Deploying myproj.prod.dummyflow to Argo Workflows...
    Internal error
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/metaflow/cli.py", line 1172, in main
    start(auto_envvar_prefix="METAFLOW", obj=state)
  File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 829, in __call__
    return self.main(args, kwargs)
  File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, ctx.params)
  File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 610, in invoke
    return callback(args, kwargs)
  File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/decorators.py", line 33, in new_func
    return f(get_current_context().obj, args, kwargs)
  File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/argo/argo_workflows_cli.py", line 156, in create
    token = resolve_token(
  File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/argo/argo_workflows_cli.py", line 382, in resolve_token
    workflow = ArgoWorkflows.get_existing_deployment(name)
  File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/argo/argo_workflows.py", line 201, in get_existing_deployment
    workflow_template = ArgoClient(
  File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/argo/argo_client.py", line 16, in __init__
    self._kubernetes_client = KubernetesClient()
  File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/kubernetes/kubernetes_client.py", line 30, in __init__
    self._refresh_client()
  File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/kubernetes/kubernetes_client.py", line 37, in _refresh_client
    config.load_incluster_config()
  File "/usr/local/lib/python3.10/site-packages/kubernetes/config/incluster_config.py", line 121, in load_incluster_config
    try_refresh_token=try_refresh_token).load_and_set(client_configuration)
  File "/usr/local/lib/python3.10/site-packages/kubernetes/config/incluster_config.py", line 54, in load_and_set
    self._load_config()
  File "/usr/local/lib/python3.10/site-packages/kubernetes/config/incluster_config.py", line 73, in _load_config
    raise ConfigException("Service token file does not exist.")
kubernetes.config.config_exception.ConfigException: Service token file does not exist.

Reason

The KUBERNETES_SERVICE_HOST environment variable happens ot be set in the Bitbucket pipeline (the container must be running in a kubernetes cluster under the hood). This makes metaflow think it should target the same cluster for deployment (see here)

if os.getenv("KUBERNETES_SERVICE_HOST"):
    # We are inside a pod, authenticate via ServiceAccount assigned to us
    config.load_incluster_config()
else:
    # Use kubeconfig, likely $HOME/.kube/config
    # TODO (savin):
    #  1. Support generating kubeconfig on the fly using boto3
    #  2. Support auth via OIDC - https://docs.aws.amazon.com/eks/latest/userguide/authenticate-oidc-identity-provider.html
    config.load_kube_config()

Proposed solution

The solution proposed in the Slack channel (see here) is to check if a KUBECONFIG environment variable is set, and make it take precedence over KUBERNETES_SERVICE_HOST

@shrinandj @savingoyal FYI - I'll submit a fix shortly 🙂