smriprep unable to allocate job

Question

smriprep unable to allocate job

girishmm opened this issue a year ago · comments

Describe the bug

smriprep was run on a bids compliant dataset with over 500 anat T1w images. workflow was created, jobs were scheduled, and some were allocated resources based on casual inspection of log. first week of run time was as planned with expected outputs of the initial steps of the smriprep pipeline saved, but around the second week smriprep was not using as much resources as expected. on inspection using htop, noticed that many threads belonging to smriprep process were idle

the standard output says,

                 Cannot allocate job 153261 (5.00GB, 8 threads).
        230421-14:50:37,273 nipype.workflow DEBUG:
                 Cannot allocate job 153262 (5.00GB, 8 threads).
        230421-14:50:37,273 nipype.workflow DEBUG:
                 Cannot allocate job 153263 (5.00GB, 8 threads).
        230421-14:50:37,273 nipype.workflow DEBUG:

and so on for all jobs.

the jobs not being allocated resources despite cpu (ncpus 80 provided) and memory being available.
is this due to non-specification of the upper bound of memory? as the process and threads are stuck at the mentioned 12.4G in htop output. If so, is there a warning message printed to the standard output that I should for?

Exact command line executed

smriprep --ncpus 80 -vvv --resource-monitor ../bids derivative participant

Are you positive that the input dataset is BIDS-compliant?

I have used the online BIDS-Validator
I have run a local installation of the BIDS-Validator (1.8.9).
I let sMRIPrep check it for me (in other words, I didn't set the --skip-bids-validation argument).
No, I haven't checked myself AND used the --skip-bids-validation argument.

sMRIPrep feedback information
Please attach the full log written to the standard output and the crashfile(s), if generated.
help needed in locating the log files as i am unable to find them.

Installation type (please complete the following information):

"Bare-metal"
Singularity
Docker

Chris Markiewicz · Answer 1 · Sat Apr 22 2023 00:29:05 GMT+0800 (China Standard Time)

This would seem to indicate that you have jobs that have finished but failed to clean up properly, so nipype thinks it has less resources than it does. You should see lines showing what jobs are currently running and how many free resources you have. Something like:

230420-16:00:02,599 nipype.workflow INFO:
        [MultiProc] Running 13 tasks, and 41 jobs ready. Free memory (GB): 55.97/56.29, Free processors: 7/20.
                    Currently running:                                                                                                                                                                                                                       
                      * _threshold88                                                                                                                                                                                                                         
                      * _threshold87
                      * _threshold86                                                                                                                                                                                                                         
                      * _threshold85                                                                                                                                                                                                                         
                      * _threshold84                                                                                                                                                                                                                         
                      * _threshold83                                                                                                                                                                                                                         
                      * _threshold82                                                                                                                                                                                                                         
                      * _threshold81                                                                                                                                                                                                                         
                      * _threshold80                                                                                                                                                                                                                         
                      * _threshold79                                                                                                                                                                                                                         
                      * _threshold78                                                                                                                                                                                                                         
                      * _threshold77

Girish Mohan · Answer 2 · Fri Apr 28 2023 17:17:55 GMT+0800 (China Standard Time)

I am unable report back with this output as my tmux session is limited to a scrollback of 2000 and I had not directed the output to a file. Since the workflow was started with verbose logging, there are over 150000 lines to the standard output in one go every few seconds. Thus, the info on free memory and processors is lost in this as my scrollback is limited. My efforts to reparent the running process or redirect the output in the hopes of getting this info was not successful either. Suggesting closing this issue due to this.

Can you share some common causes for a failed clean up?

Chris Markiewicz · Answer 3 · Wed May 10 2023 20:39:19 GMT+0800 (China Standard Time)

The main thing I can think of is getting killed by the OS for attempting to allocate excess memory. When that happens, Python does not always mark the process as having terminated, so the nipype scheduler would not know that the resources are now free or that the workflow has been interrupted.

Without logs, it's going to be impossible to debug.

Girish Mohan · Answer 4 · Wed May 10 2023 20:44:44 GMT+0800 (China Standard Time)

Yep, that makes sense. Apologies, but I tried and failed to retrieve the logs. Since we cannot proceed further, I am closing this issue.