run inference duration

Question

run inference duration

OliverWarrington opened this issue 2 years ago · comments

Hello hippunfold!

I'm unsure whether the run inference rule has got stuck or whether it is still running. It has been going for 5 hours and counting and I cannot see any log file or output related to its progress.

I got to the same point in a previous run, which I aborted after 3 hours as I needed the computing power for something else. The current run was a completely fresh start. All output from the previous run was deleted.

This might be expected, in which case disregard, but please could you help me figure out what's happening if not?

Run command

docker run -it --rm \
    -v /Users/OliverW/Dev/hippunfold/bids:/bids:ro \
    -v /Users/OliverW/Dev/hippunfold/output:/output \
    khanlab/hippunfold:latest \
    /bids /output participant -p \
    --modality T1w --cores all \
    --path_T1w "bids/sub-001/sub-{subject}_T1w.nii.gz" \
    --path_T2w "bids/sub-001/sub-{subject}_T2w.nii.gz"

Computer resources

Mac:
macOS Monterey version 12.3.1
M1 Pro chip

Docker resources allocated:
CPUs: 8
Mem: 12GB

I can see that 100% of the CPU and 4GB of memory are currently being used in Docker.

Stdout

[Wed Jun 15 06:41:36 2022]
rule run_inference:
    input: work/sub-001/anat/sub-001_hemi-R_space-corobl_desc-preproc_T1w.nii.gz, /opt/hippunfold_cache/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar
    output: work/sub-001/anat/sub-001_hemi-R_space-corobl_desc-nnunet_dseg.nii.gz
    log: logs/sub-001/sub-001_hemi-R_space-corobl_nnunet.txt
    jobid: 76
    wildcards: subject=001, hemi=R
    threads: 8
    resources: tmpdir=/tmp, gpus=0, mem_mb=16000, time=60

mkdir -p tempmodel tempimg templbl && cp work/sub-001/anat/sub-001_hemi-R_space-corobl_desc-preproc_T1w.nii.gz tempimg/temp_0000.nii.gz && tar -xf /opt/hippunfold_cache/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar -C tempmodel && export RESULTS_FOLDER=tempmodel && export nnUNet_n_proc_DA=8 && nnUNet_predict -i tempimg -o templbl -t Task101_hcp1200_T1w -chk model_best --disable_tta &> logs/sub-001/sub-001_hemi-R_space-corobl_nnunet.txt && cp templbl/temp.nii.gz work/sub-001/anat/sub-001_hemi-R_space-corobl_desc-nnunet_dseg.nii.gz
mkdir -p tempmodel tempimg templbl && cp work/sub-001/anat/sub-001_hemi-R_space-corobl_desc-preproc_T1w.nii.gz tempimg/temp_0000.nii.gz && tar -xf /opt/hippunfold_cache/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar -C tempmodel && export RESULTS_FOLDER=tempmodel && export nnUNet_n_proc_DA=8 && nnUNet_predict -i tempimg -o templbl -t Task101_hcp1200_T1w -chk model_best --disable_tta &> logs/sub-001/sub-001_hemi-R_space-corobl_nnunet.txt && cp templbl/temp.nii.gz work/sub-001/anat/sub-001_hemi-R_space-corobl_desc-nnunet_dseg.nii.gz

Ali Khan · Answer 1 · Mon Jun 20 2022 23:58:09 GMT+0800 (China Standard Time)

Sorry for the late reply (was on holiday at the time, and now at OHBM) - that certainly sounds strange -- the fact that you get no log output in logs/sub-001/sub-001_hemi-R_space-corobl_nnunet.txt seems to suggest there might be an issue in the shell commands leading up to the actual nnUNet_predict call, i.e. somewhere in:

mkdir -p tempmodel tempimg templbl && cp work/sub-001/anat/sub-001_hemi-R_space-corobl_desc-preproc_T1w.nii.gz tempimg/temp_0000.nii.gz &&
tar -xf /opt/hippunfold_cache/trained_model.3d_fullres.Task101_hcp1200_T1w.nnUNetTrainerV2.model_best.tar -C tempmodel

I'm going to try to reproduce your error as I have a M1 mac too, but haven't really been using it for hippunfold.

OliverWarrington · Answer 2 · Fri Jun 24 2022 16:01:12 GMT+0800 (China Standard Time)

No problem! Thanks for looking into it. Just to confirm, it never finished running.

Ali Khan · Answer 3 · Tue Jun 28 2022 20:47:15 GMT+0800 (China Standard Time)

Ok had a chance to try running it myself on my apple M1.

It is really slow (seems to be at least 10x slower, just based on the how long it takes to run the first several rules), but it seems that is to be expected since it is an amd64 container (and running in emulated mode). The inference step is the most demanding one, and so that may just be consuming more cpu and memory than is available.

So until we get a chance to build a new container with arm64 architecture, I would say apple M1 is unsupported for hippunfold. Sorry!

Another alternative (if you don't have access to any other intel-based system) is to use CBRAIN. We have hippunfold 1.0.0 set-up there, if you sign up for an account you should be able to run it through the web-based ui there.
https://portal.cbrain.mcgill.ca

OliverWarrington · Answer 4 · Fri Jul 29 2022 17:36:27 GMT+0800 (China Standard Time)

No problem! Look forward to trying whenever a new arm container is available.

Thanks for checking this out. I can leave this open for other's with apple M1 to see or feel free to close.