"Service token file does not exist" error when deploying flow to Argo from CI
leogargu opened this issue · comments
leogargu commented
Description
When running argo-workflows create
from a Bitbucket Cloud pipeline, the following error is raised:
$ python main.py --production argo-workflows create --only-json
Metaflow 2.7.19 executing DummyFlow for user:bitbucket
Project: myproj, Branch: prod
Validating your flow...
The graph looks good!
Deploying myproj.prod.dummyflow to Argo Workflows...
Internal error
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/metaflow/cli.py", line 1172, in main
start(auto_envvar_prefix="METAFLOW", obj=state)
File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 829, in __call__
return self.main(args, kwargs)
File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 610, in invoke
return callback(args, kwargs)
File "/usr/local/lib/python3.10/site-packages/metaflow/_vendor/click/decorators.py", line 33, in new_func
return f(get_current_context().obj, args, kwargs)
File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/argo/argo_workflows_cli.py", line 156, in create
token = resolve_token(
File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/argo/argo_workflows_cli.py", line 382, in resolve_token
workflow = ArgoWorkflows.get_existing_deployment(name)
File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/argo/argo_workflows.py", line 201, in get_existing_deployment
workflow_template = ArgoClient(
File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/argo/argo_client.py", line 16, in __init__
self._kubernetes_client = KubernetesClient()
File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/kubernetes/kubernetes_client.py", line 30, in __init__
self._refresh_client()
File "/usr/local/lib/python3.10/site-packages/metaflow/plugins/kubernetes/kubernetes_client.py", line 37, in _refresh_client
config.load_incluster_config()
File "/usr/local/lib/python3.10/site-packages/kubernetes/config/incluster_config.py", line 121, in load_incluster_config
try_refresh_token=try_refresh_token).load_and_set(client_configuration)
File "/usr/local/lib/python3.10/site-packages/kubernetes/config/incluster_config.py", line 54, in load_and_set
self._load_config()
File "/usr/local/lib/python3.10/site-packages/kubernetes/config/incluster_config.py", line 73, in _load_config
raise ConfigException("Service token file does not exist.")
kubernetes.config.config_exception.ConfigException: Service token file does not exist.
Reason
The KUBERNETES_SERVICE_HOST
environment variable happens ot be set in the Bitbucket pipeline (the container must be running in a kubernetes cluster under the hood). This makes metaflow think it should target the same cluster for deployment (see here)
if os.getenv("KUBERNETES_SERVICE_HOST"):
# We are inside a pod, authenticate via ServiceAccount assigned to us
config.load_incluster_config()
else:
# Use kubeconfig, likely $HOME/.kube/config
# TODO (savin):
# 1. Support generating kubeconfig on the fly using boto3
# 2. Support auth via OIDC - https://docs.aws.amazon.com/eks/latest/userguide/authenticate-oidc-identity-provider.html
config.load_kube_config()
Proposed solution
The solution proposed in the Slack channel (see here) is to check if a KUBECONFIG
environment variable is set, and make it take precedence over KUBERNETES_SERVICE_HOST
leogargu commented
@shrinandj @savingoyal FYI - I'll submit a fix shortly 🙂