Exercise 3 Error with ./Allrun in renumbermesh

Question

Exercise 3 Error with ./Allrun in renumbermesh

hyzahw opened this issue 5 months ago · comments

Hi,

I followed the steps in tutorial 3 to run the cylinder case on OpenFOAM. The ./Allrun worked fine with blockMesh and decomposePar but came out with an error in renumberMesh and consequently with pimpleFoam.

The reumberMesh log file shows the following:

INFO: squashfuse not found, will not be able to mount SIF or other squashfs files
INFO: fuse2fs not found, will not be able to mount EXT3 filesystems
INFO: gocryptfs not found, will not be able to use gocryptfs
INFO: Converting SIF file to temporary sandbox...
INFO: squashfuse not found, will not be able to mount SIF or other squashfs files
INFO: fuse2fs not found, will not be able to mount EXT3 filesystems
INFO: gocryptfs not found, will not be able to use gocryptfs
INFO: Converting SIF file to temporary sandbox...
FATAL: while extracting /home/hyzahw/MLCFD/ml-cfd-lecture/of2206-py1.12.1-cpu.sif: root filesystem extraction failed: extract command failed: WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
WARNING: Skipping mount /etc/hosts [binds]: /etc/hosts doesn't exist in container
WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/tmp [tmp]: /tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/var/tmp [tmp]: /var/tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container

FATAL ERROR:write_file: failed to create file /image/root/opt/libtorch/include/ATen/ops/combinations_compositeimplicitautograd_dispatch.h, because Too many open files
Parallel unsquashfs: Using 8 processors
74903 inodes (86780 blocks) to write

: exit status 1
FATAL: while extracting /home/hyzahw/MLCFD/ml-cfd-lecture/of2206-py1.12.1-cpu.sif: root filesystem extraction failed: extract command failed: WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
WARNING: Skipping mount /etc/hosts [binds]: /etc/hosts doesn't exist in container
WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/tmp [tmp]: /tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/var/tmp [tmp]: /var/tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container

FATAL ERROR:write_file: failed to create file /image/root/opt/libtorch/include/ATen/ops/as_strided_copy.h, because Too many open files
Parallel unsquashfs: Using 8 processors
74903 inodes (86780 blocks) to write

: exit status 1

The first 4 lines appear also in my log files for blockMesh and decomposePar, I am not aware if this is normal or not but I believe it has something to do with the Fatal error below.

Thank you in advance!

Janis Geise · Answer 1 · Fri Feb 02 2024 02:31:34 GMT+0800 (China Standard Time)

Hi @hyzahw,

it looks like an issue with the Apptainer container to me. In order to help you better, can you check the following:

are you using Linux as native OS or are you using WSL (Linux subsystem for Windows)?
did you follow the setup explained in exercise 2, especially the part regarding Apptainer and building the container?
is the container located at the top-level of the repository?

Regards,
Janis

hyzahw · Answer 2 · Fri Feb 02 2024 08:23:02 GMT+0800 (China Standard Time)

Hi Janis,

Thanks for your answer. I am using Ubuntu as a native OS. I checked again the whole steps of installing Apptainer and I manage to not get the errors now concerning squashfuse and fuse2fs so I guess the Apptainer problem is solved.

I get now this error when starting the runParallel renumberMesh step in ./Allrun :

--> FOAM FATAL ERROR: (openfoam-2206)
attempt to run parallel on 1 processor

    From static bool Foam::UPstream::init(int&, char**&, bool)
    in file UPstream.C at line 286.

FOAM aborting

I tried ./Allrun command in another tutorial that ran also in parallel. decomposePar was executed normally also as before and the same error came once the runParallel step was executed. Do you have any recommendations?

Janis Geise · Answer 3 · Fri Feb 02 2024 19:35:32 GMT+0800 (China Standard Time)

This may be an issue with MPI, can you check which version is installed by executing the command mpiexec --version in a terminal? You can further check if the simulation is executed when running it on only one CPU. Therefore, just change the Allrun script from:

# decompose and run case
runApplication decomposePar
runParallel renumberMesh -overwrite
runParallel $(getApplication)

to:

# decompose and run case
# runApplication decomposePar
# runParallel renumberMesh -overwrite
runApplication $(getApplication)

Edit: I get this error when numberOfSubdomains in system/decomposedParDict is set to 1. Can you check if this parameter is set to 2 in your case? It should look like: numberOfSubdomains 2;

Regards
Janis

hyzahw · Answer 4 · Wed Feb 07 2024 10:34:11 GMT+0800 (China Standard Time)

I was using mpi version 3.3.2 when I encountered the problem. Now I upgraded to 5.0.1 and I still get the same error. Can you mention the version that you are using? Should there be a compatibility between OF2206 and a specific MPI version?

The case runs normally if I don't decompose.

The decomposeParDict is similar to the original file in test_cases

numberOfSubdomains  2;

method              hierarchical;

coeffs
{
    n               (2 1 1);
}

Janis Geise · Answer 5 · Fri Feb 09 2024 22:57:43 GMT+0800 (China Standard Time)

Hi @hyzahw,

I have MPI version 4.0.3 installed, as far as I know the Apptainer container uses MPI version 4.1.2. I have encountered these issues with executing simulations when there is a too large difference in between MPI version in the past.

If it is not too much trouble, you can try to install the MPI version used in your container (you can check which version is used by executing mpiexec --version inside the container). Alternatively, maybe @AndreWeiner has some additional ideas or tips which may help you with the issue.

Regards
Janis

hyzahw · Answer 6 · Tue Feb 13 2024 06:53:57 GMT+0800 (China Standard Time)

After re-installing MPI version 4.1.2, the simulation was run in parallel normally.

Thanks @JanisGeise! ;)