Parsl / parsl

Parsl - a Python parallel scripting library

Home Page:http://parsl-project.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Example using kubernetes provider?

jkitchin opened this issue · comments

I am trying to get a Kubernetes provider to work with parsl.

I have a working kubernetes cluster, with kubectl setup. I can setup pods with kubectl, and open shells in them. Based on https://parsl.readthedocs.io/en/stable/stubs/parsl.providers.KubernetesProvider.html, I have set this up:

import parsl
from parsl import python_app
from parsl.config import Config
from parsl.providers import KubernetesProvider
from parsl.executors import HighThroughputExecutor

config = Config(
    executors=[
        HighThroughputExecutor(
            label='PM_HTEX_multinode',
            cores_per_worker=2,
            provider=KubernetesProvider(
                image='jkitchin/pycse',
                namespace='jkitchin',
                pod_name='jk-',
                user_id='1000',
                group_id='100'
            ),
        )
    ]
)

# load the Parsl config
parsl.load(config)


@python_app
def exc():
    import socket
    return socket.gethostname()

exc().result()

It does run, and it tries to create a pod, but the pod fails, and the logs indicate:



Traceback (most recent call last):
--
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/bin/process_worker_pool.py", line 687, in <module>
Tue, Oct 17 2023 2:09:37 pm | os.makedirs(os.path.join(args.logdir, "block-{}".format(args.block_id), args.uid), exist_ok=True)
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/lib/python3.9/os.py", line 215, in makedirs
Tue, Oct 17 2023 2:09:37 pm | makedirs(head, exist_ok=exist_ok)
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/lib/python3.9/os.py", line 215, in makedirs
Tue, Oct 17 2023 2:09:37 pm | makedirs(head, exist_ok=exist_ok)
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/lib/python3.9/os.py", line 215, in makedirs
Tue, Oct 17 2023 2:09:37 pm | makedirs(head, exist_ok=exist_ok)
Tue, Oct 17 2023 2:09:37 pm | [Previous line repeated 7 more times]
Tue, Oct 17 2023 2:09:37 pm | File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
Tue, Oct 17 2023 2:09:37 pm | mkdir(name, mode)
Tue, Oct 17 2023 2:09:37 pm | PermissionError: [Errno 13] Permission denied: '/Users'
Tue, Oct 17 2023 2:09:37 pm | /bin/bash: -c: line 4: syntax error near unexpected token `;'
Tue, Oct 17 2023 2:09:37 pm | /bin/bash: -c: line 4: `;'

I can see there is a permission error related to making a directory /Users. I don't see anywhere obvious to change this.

Are there any examples of using Kubernetes with parsl somewhere? (I looked, but did not find anything).

addendum:

Digging in to the yaml for the pod, I see this, which seems like something isn't right. the logdir is a local directory on my machine, but the kubernetes cluster where the pod is created is a remote cluster, so that logdir won't exist there.

  process_worker_pool.py   -a Johns-iMac-4.local,128.2.149.108,172.31.61.78 -p 0 -c 2 -m None --poll 10 --task_port=54918 --result_port=54542 --logdir=/Users/jkitchin/example/runinfo/034/PM_HTEX_multinode --block_id=2 --hb_period=30  --hb_threshold=120 --cpu-affinity none --available-accelerators  --start-method spawn

I guess this means something is not setup right in here.

update 2:
It does work if I run it in a pod on the kubernetes cluster. Although it seems to create 4 pods, and they don't close when the job is done. It seems like they should.

Is there a way to make it work remotely?

I have made a smidge of progress getting this to work.

Some prerequisites that weren't obvious:

  1. The Python and parsl versions have to be the same on the local and remote machines.
  2. You have to set a worker_logdir_root in the executor for the remote path.

Here is a minimally working example for me.

import parsl
from parsl import python_app
from parsl.config import Config
from parsl.providers import KubernetesProvider
from parsl.executors import HighThroughputExecutor

import logging
logging.captureWarnings(True)

config = Config(
    executors=[
        HighThroughputExecutor(
            label='HTE',
            cores_per_worker=2,            
            worker_logdir_root='/home/jovyan/logs/',
            provider=KubernetesProvider(
                image='jkitchin/pycse',
                pod_name='jrk-',
                # this does not work
                #                persistent_volumes=[('shared-scratch', '/home/jovyan/shared-scratch/')]
            ),
        )
    ]
)

# load the Parsl config
parsl.load(config)


@python_app
def exc():
    import os, socket
    return socket.gethostname()

print('done', exc().result())

I can't get persistent volumes to work, I see messages that indicate the name can't be found (30 persistentvolumeclaim "shared-scratch" not found.) This is a volume that I mount in other pods though.

Also for some reason, this makes 4 pods. When nothing goes wrong, one of them terminates (the one that returns from the app), but the other 3 are left running. Every so often, the ones left running seem to error and restart.