DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dsub and multiprocessing

lm-jkominek opened this issue · comments

Hi, I am using dsub to submit a bunch of jobs to GCE, which run inside a docker image and use the Python's multiprocessing library for further parallelization inside of that. The jobs themselves run fine, but there is a certain number (anywhere between 10-20%) that finish up, but freeze at the very end, and just keep endlessly logging out without actually terminating, e.g.:

2023-08-04 13:16:19 INFO: gsutil -h Content-Type:text/plain -mq cp /tmp/continuous_logging_action/stdout gs://bucket/outdir/test/test_dir/003/log/test---user--230804-130707-40.1-stdout.log

What's weird is that this behavior is non-deterministic - I can submit the same jobs multiple time and each round different jobs will freeze, so it's not an issue with the input data. I know my test case is good in general, because the jobs all run fine if I skip the internal multiprocessing altogether and just run all the pieces subsequently, it just takes longer, obviously.

I found a kind of similar issue here from a few years back that related to delocalization (#165) but I run some tests, added more memory and varied the CPU counts, to no avail, so it's not an issue with the executing VM.
My jobs use the logger lib for console outputs, and there is apparently a well-known issue with using that and multiprocessing, but I've commented all that logging stuff out, and it didn't help. I've tried reducing the size of the multiprocessing pool (e.g. 6 processes on an 8-core VM), and no change either.

Any insights or experiences that could help my situation would be GREATLY appreciated!

Hi @jacekzkominek ,

Sorry to hear about the battle with multiprocessing. Your note "this behavior is non-deterministic" is a painful part of multiprocessing and figuring out race conditions is often quite the time sink.

I don't have a lot of insight into issues with the multiprocessing library specifically. I am biased toward single-threaded solutions on cheap (preferably preemptible) VMs to avoid these issues. Not sure if any of that helps you for this case.

I do like that you attempted to use fewer processes than cores. That's one thing that has cropped up before where it helps to leave an "idle" core to service system needs.

The only other bits of help I think I can offer:

  • Does the likelihood of hanging decrease with the number of processes used?
  • Are you using the latest versions of Python/multiprocessing?
  • Are you using a "standard" Docker image?
  • Can use set up a kill/retry based on a maximum timeout?

A few years back, we had some workflows that for some inputs would run longer than 24 hours. We wanted to make sure that they didn't get preempted at 24 hours and retried (since they'd fail again), so we used the linux timeout command to generate a hard (non-preemption) failure. It looked something like:

timeout 23.5h STAR ...

(We actually wanted to kill it at 23 1/2 hours since delocalization could take 1/2 hour.)

I mention the using "timeout" as "hanging" means sitting idle, burning CPU/memory/disk hours. Not sure if you can time bound your commands.

Thank you for the response and insights @mbookman!
I hear you on preferably running individual jobs, but my project is ~100,000 analyses, grouped in 1000 batches of 100 analyses each, set up this way (A) to reduce overhead and (B) because GCE would only allow me to run ~2,000 jobs at once tops, which would make running 100K a mess.

As to your questions/points:
#1 - Not that I have seen by any significant measure. I tried cutting 1 or 2 or 4 processes on 8 and 16 CPU VMs, no diff, and it is kind of bordering cost-inefficiency there...
#2 and #3 - I am running my analyses in a custom-built docker image, using google/cloud-sdk:427.0.1-slim as base , which has Python 3.9.2, and since multiprocessing is part of the standard library, I presume it's also at that version. I could try and amp that up to 3.10 or 3.11, or try a different base, but there's a bunch of different tools and libraries in that docker image and I am not sure if the dependencies would all agree (biopython likely won't...)
#4 - That might actually be an option, since (A) the time it takes a single analysis to run is pretty consistently about 4-5 minutes, so I could cap that at a timeout 600 and (B) even the jobs that hang actually do their thing as expected, they just don't exit once all is done.

The one thing where I think I have seen some impact was reducing the number of analyses run within a multiprocessing pool. I have 100 analyses per job, and my testing showed that running only 20 of them is much less likely to hang than, say running 50 or all 100 of them. So, I tried running a few smaller pools sequentially to get around that, but the issue was still present...

Just a small follow-up, I experimented with the base image and Python versions, and got some improvement by switching from google/cloud-sdk (which has Python 3.9) to ubuntu:23.04 (which has Python 3.11), but still ~5% of the jobs keep freezing. I guess it's just something to accept and either implement a hard timeout or simply run things sequentially.

I also tried switching from standard Python multiprocessing to pathos.multiprocessing and joblib, but no cigar either.

@lm-jkominek I know dsub has retries, but if you want complete control of it while it is running under multiple processes, here's a few wrapper scripts you can launch your commands through to get around stuck processes:

$1)$ The first program is the wrapper program that checks on the status of the Python program, checking for either "success" or "retry". Here it's called wrapper.sh:

#!/bin/bash

PROGRAM_STATUS=''

retries=0

MAX_RETRIES=$1

SECONDS_BETWEEN_STATUS_CHECKS=$2
LOOPS=$3

while [ $retries -le $MAX_RETRIES ]
do

  PROGRAM_STATUS=`python program_with_status.py $SECONDS_BETWEEN_STATUS_CHECKS $LOOPS`

  if [[ "$PROGRAM_STATUS" == "success" ]]; then
     echo $PROGRAM_STATUS
     exit 0
  fi

  ((retries++))

done

echo $PROGRAM_STATUS

$2)$ The second is the program you are interested in running -- this can be your dsub program -- that would be launched under Python (here it's called test.sh):

#!/bin/bash

command=$1
loop=$2
counter=0

if [[ "$command" == "loop" ]]; then

  while [ $counter -le $loop ]
  do
    #echo $counter
    ((counter++))
  done

fi

$3)$ The third is the Python program that would run your dsub script under a fork, and be periodically checked upon by the parent thread -- returning either "success" or "retry" (here it's called program_with_status.py):

import os
import psutil
import signal
import sys
import time

seconds_between_checks = int( sys.argv[1] )
loops = sys.argv[2]

# This creates a fork
pid = os.fork()
  
def is_defunct( pid ):
    
    proc = psutil.Process(pid)
    if proc.status() == psutil.STATUS_ZOMBIE:
        return True
    return False

def retry_or_not(pid, checks, max_checks):        
    
    if psutil.pid_exists(pid):
        
        if is_defunct( pid ):
            try:
                os.kill(pid, signal.SIGKILL)
                return "success"
            except:
                return "success"
        
        if checks >= max_checks:
            try:
                os.kill(pid, signal.SIGKILL)
            except:
                return "retry"
            return "retry"
        return "continue"
    else:
        return "success"

  
# This is the parent process
if pid > 0 :
    #print("This is the parent process checking on the child...")
    #print("Process ID:", os.getpid())
    #print("Child's process ID:", pid)
    child_pid = pid
    max_checks = 3
    checks = 0
    child_status = ""
    while( True ):
        time.sleep( seconds_between_checks )
        child_status = retry_or_not( child_pid, checks, max_checks )
        #print( ['child_status', child_status])
        if child_status.startswith("success"):
            print( child_status )
            sys.exit() 
        checks = checks + 1
        if checks > max_checks:
            print( child_status )
            sys.exit()

# This is a child process
else :
    #print("\nThis is the child process:")
    #print("Process ID:", os.getpid())
    #print("Parent's process ID:", os.getppid())
    args = ("loop", str(loops) )
    program = './test.sh'
    os.execlp( program, program, *args)

You can launch your program like this:

./wrapper.sh NUMBER_OF_RETRIES SECONDS_BETWEEN_STATUS_CHECKS NUMBER_OF_LOOPS

Examples:

$ ./wrapper.sh 3 3 1000
success
$
$ ./wrapper.sh 3 3 1000000
retry
$

Feel free to modify and adapt as necessary.

Hope it helps,
Paul

Thank you @pgrosu, @mbookman , appreciate the scripts! They're helpful, even if they don't directly address the freezes I'm been seeing. After trying multiple parallel libs (pathos, joblib, concurrent.futures), I also tried rewriting the code under the set_start_method('spawn') rather that the default 'fork', still no diff, so I am slowly resigning myself here.

I'm curious about something thought - is there a bandwidth/transfer limitations in either dsub or Google Cloud? My jobs write their results (~10MBs each) to a Google Cloud bucket through an --output-recursive arg to dsub. Normally, all of the 100 analyses in a job run sequentially, so each job ends up writing its results at different times, even when I have 1,000 jobs running concurrently. When I run the analyses within a job in parallel though, the jobs end up finishing at almost the same time - could that results in the jobs freezing because they are endlessly waiting for some throughput/bandwidth, a lock release or something else from Google Cloud, or they hit some transfer quota?

Hi @lm-jkominek,

It could be, but I doubt it based on the quotas listed here -- and knowing the throughput of GoogleStorage (and different type of errors that would be reported otherwise):

https://cloud.google.com/batch/quotas

https://cloud.google.com/life-sciences/quotas

https://github.com/DataBiosphere/dsub/blob/main/docs/compute_quotas.md

So spawning starts a whole new process, which is heavier resource-wise as compared to fork, which is within the same process, as described here.

This is partially the reason I wrote the Python/wrapper scripts as such, so that it retries even under heavy loads. In fact, you can adapt the Python script to check on the operation that you can get via dstat if that provides better granularity.

In any case, just out of curiosity, why do you think my scripts do not directly address the freezing issue? The Python and wrapper scripts can be adapted to check and retry. So even under heavy parallelism you can serialize them where some can wait if too many files are being updated. Have you tried a binary search on the number of submitted jobs -- such as 500, 250 or 125 jobs -- to see if you experience the same issue? You have to remember most of the reading and writing is network-based and thus is best-effort, which is why retries are necessary during higher throughput.

Hope it helps,
Paul