Example using kubernetes provider?
jkitchin opened this issue · comments
I am trying to get a Kubernetes provider to work with parsl.
I have a working kubernetes cluster, with kubectl setup. I can setup pods with kubectl, and open shells in them. Based on https://parsl.readthedocs.io/en/stable/stubs/parsl.providers.KubernetesProvider.html, I have set this up:
import parsl
from parsl import python_app
from parsl.config import Config
from parsl.providers import KubernetesProvider
from parsl.executors import HighThroughputExecutor
config = Config(
executors=[
HighThroughputExecutor(
label='PM_HTEX_multinode',
cores_per_worker=2,
provider=KubernetesProvider(
image='jkitchin/pycse',
namespace='jkitchin',
pod_name='jk-',
user_id='1000',
group_id='100'
),
)
]
)
# load the Parsl config
parsl.load(config)
@python_app
def exc():
import socket
return socket.gethostname()
exc().result()
It does run, and it tries to create a pod, but the pod fails, and the logs indicate:
Traceback (most recent call last):
--
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/bin/process_worker_pool.py", line 687, in <module>
Tue, Oct 17 2023 2:09:37 pm | os.makedirs(os.path.join(args.logdir, "block-{}".format(args.block_id), args.uid), exist_ok=True)
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/lib/python3.9/os.py", line 215, in makedirs
Tue, Oct 17 2023 2:09:37 pm | makedirs(head, exist_ok=exist_ok)
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/lib/python3.9/os.py", line 215, in makedirs
Tue, Oct 17 2023 2:09:37 pm | makedirs(head, exist_ok=exist_ok)
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/lib/python3.9/os.py", line 215, in makedirs
Tue, Oct 17 2023 2:09:37 pm | makedirs(head, exist_ok=exist_ok)
Tue, Oct 17 2023 2:09:37 pm | [Previous line repeated 7 more times]
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
Tue, Oct 17 2023 2:09:37 pm | mkdir(name, mode)
Tue, Oct 17 2023 2:09:37 pm | PermissionError: [Errno 13] Permission denied: '/Users'
Tue, Oct 17 2023 2:09:37 pm | /bin/bash: -c: line 4: syntax error near unexpected token `;'
Tue, Oct 17 2023 2:09:37 pm | /bin/bash: -c: line 4: `;'
I can see there is a permission error related to making a directory /Users. I don't see anywhere obvious to change this.
Are there any examples of using Kubernetes with parsl somewhere? (I looked, but did not find anything).
addendum:
Digging in to the yaml for the pod, I see this, which seems like something isn't right. the logdir is a local directory on my machine, but the kubernetes cluster where the pod is created is a remote cluster, so that logdir won't exist there.
process_worker_pool.py -a Johns-iMac-4.local,128.2.149.108,172.31.61.78 -p 0 -c 2 -m None --poll 10 --task_port=54918 --result_port=54542 --logdir=/Users/jkitchin/example/runinfo/034/PM_HTEX_multinode --block_id=2 --hb_period=30 --hb_threshold=120 --cpu-affinity none --available-accelerators --start-method spawn
I guess this means something is not setup right in here.
update 2:
It does work if I run it in a pod on the kubernetes cluster. Although it seems to create 4 pods, and they don't close when the job is done. It seems like they should.
Is there a way to make it work remotely?
I have made a smidge of progress getting this to work.
Some prerequisites that weren't obvious:
- The Python and parsl versions have to be the same on the local and remote machines.
- You have to set a worker_logdir_root in the executor for the remote path.
Here is a minimally working example for me.
import parsl
from parsl import python_app
from parsl.config import Config
from parsl.providers import KubernetesProvider
from parsl.executors import HighThroughputExecutor
import logging
logging.captureWarnings(True)
config = Config(
executors=[
HighThroughputExecutor(
label='HTE',
cores_per_worker=2,
worker_logdir_root='/home/jovyan/logs/',
provider=KubernetesProvider(
image='jkitchin/pycse',
pod_name='jrk-',
# this does not work
# persistent_volumes=[('shared-scratch', '/home/jovyan/shared-scratch/')]
),
)
]
)
# load the Parsl config
parsl.load(config)
@python_app
def exc():
import os, socket
return socket.gethostname()
print('done', exc().result())
I can't get persistent volumes to work, I see messages that indicate the name can't be found (30 persistentvolumeclaim "shared-scratch" not found.) This is a volume that I mount in other pods though.
Also for some reason, this makes 4 pods. When nothing goes wrong, one of them terminates (the one that returns from the app), but the other 3 are left running. Every so often, the ones left running seem to error and restart.