nipreps / smriprep

Structural MRI PREProcessing (sMRIPrep) workflows for NIPreps (NeuroImaging PREProcessing tools)

Home Page:https://nipreps.github.io/smriprep

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

smriprep unable to allocate job

girishmm opened this issue · comments

Describe the bug

smriprep was run on a bids compliant dataset with over 500 anat T1w images. workflow was created, jobs were scheduled, and some were allocated resources based on casual inspection of log. first week of run time was as planned with expected outputs of the initial steps of the smriprep pipeline saved, but around the second week smriprep was not using as much resources as expected. on inspection using htop, noticed that many threads belonging to smriprep process were idle

Screenshot 2023-04-21 at 2 39 55 PM

the standard output says,

                 Cannot allocate job 153261 (5.00GB, 8 threads).
        230421-14:50:37,273 nipype.workflow DEBUG:
                 Cannot allocate job 153262 (5.00GB, 8 threads).
        230421-14:50:37,273 nipype.workflow DEBUG:
                 Cannot allocate job 153263 (5.00GB, 8 threads).
        230421-14:50:37,273 nipype.workflow DEBUG:

and so on for all jobs.

  1. the jobs not being allocated resources despite cpu (ncpus 80 provided) and memory being available.
  2. is this due to non-specification of the upper bound of memory? as the process and threads are stuck at the mentioned 12.4G in htop output. If so, is there a warning message printed to the standard output that I should for?

Exact command line executed

smriprep --ncpus 80 -vvv --resource-monitor ../bids derivative participant

Are you positive that the input dataset is BIDS-compliant?

  • I have used the online BIDS-Validator
  • I have run a local installation of the BIDS-Validator (1.8.9).
  • I let sMRIPrep check it for me (in other words, I didn't set the --skip-bids-validation argument).
  • No, I haven't checked myself AND used the --skip-bids-validation argument.

sMRIPrep feedback information
Please attach the full log written to the standard output and the crashfile(s), if generated.
help needed in locating the log files as i am unable to find them.

Installation type (please complete the following information):

  • "Bare-metal"
  • Singularity
  • Docker

This would seem to indicate that you have jobs that have finished but failed to clean up properly, so nipype thinks it has less resources than it does. You should see lines showing what jobs are currently running and how many free resources you have. Something like:

230420-16:00:02,599 nipype.workflow INFO:
        [MultiProc] Running 13 tasks, and 41 jobs ready. Free memory (GB): 55.97/56.29, Free processors: 7/20.
                    Currently running:                                                                                                                                                                                                                       
                      * _threshold88                                                                                                                                                                                                                         
                      * _threshold87
                      * _threshold86                                                                                                                                                                                                                         
                      * _threshold85                                                                                                                                                                                                                         
                      * _threshold84                                                                                                                                                                                                                         
                      * _threshold83                                                                                                                                                                                                                         
                      * _threshold82                                                                                                                                                                                                                         
                      * _threshold81                                                                                                                                                                                                                         
                      * _threshold80                                                                                                                                                                                                                         
                      * _threshold79                                                                                                                                                                                                                         
                      * _threshold78                                                                                                                                                                                                                         
                      * _threshold77                

I am unable report back with this output as my tmux session is limited to a scrollback of 2000 and I had not directed the output to a file. Since the workflow was started with verbose logging, there are over 150000 lines to the standard output in one go every few seconds. Thus, the info on free memory and processors is lost in this as my scrollback is limited. My efforts to reparent the running process or redirect the output in the hopes of getting this info was not successful either. Suggesting closing this issue due to this.

Can you share some common causes for a failed clean up?

The main thing I can think of is getting killed by the OS for attempting to allocate excess memory. When that happens, Python does not always mark the process as having terminated, so the nipype scheduler would not know that the resources are now free or that the workflow has been interrupted.

Without logs, it's going to be impossible to debug.

Yep, that makes sense. Apologies, but I tried and failed to retrieve the logs. Since we cannot proceed further, I am closing this issue.