AndreWeiner / ml-cfd-lecture

Lecture material for machine learning applied to computational fluid mechanics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Exercise 3 Error with ./Allrun in renumbermesh

hyzahw opened this issue · comments

Hi,

I followed the steps in tutorial 3 to run the cylinder case on OpenFOAM. The ./Allrun worked fine with blockMesh and decomposePar but came out with an error in renumberMesh and consequently with pimpleFoam.

The reumberMesh log file shows the following:


INFO: squashfuse not found, will not be able to mount SIF or other squashfs files
INFO: fuse2fs not found, will not be able to mount EXT3 filesystems
INFO: gocryptfs not found, will not be able to use gocryptfs
INFO: Converting SIF file to temporary sandbox...
INFO: squashfuse not found, will not be able to mount SIF or other squashfs files
INFO: fuse2fs not found, will not be able to mount EXT3 filesystems
INFO: gocryptfs not found, will not be able to use gocryptfs
INFO: Converting SIF file to temporary sandbox...
FATAL: while extracting /home/hyzahw/MLCFD/ml-cfd-lecture/of2206-py1.12.1-cpu.sif: root filesystem extraction failed: extract command failed: WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
WARNING: Skipping mount /etc/hosts [binds]: /etc/hosts doesn't exist in container
WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/tmp [tmp]: /tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/var/tmp [tmp]: /var/tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container

FATAL ERROR:write_file: failed to create file /image/root/opt/libtorch/include/ATen/ops/combinations_compositeimplicitautograd_dispatch.h, because Too many open files
Parallel unsquashfs: Using 8 processors
74903 inodes (86780 blocks) to write

: exit status 1
FATAL: while extracting /home/hyzahw/MLCFD/ml-cfd-lecture/of2206-py1.12.1-cpu.sif: root filesystem extraction failed: extract command failed: WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
WARNING: Skipping mount /etc/hosts [binds]: /etc/hosts doesn't exist in container
WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/tmp [tmp]: /tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/var/tmp [tmp]: /var/tmp doesn't exist in container
WARNING: Skipping mount /usr/local/var/apptainer/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container

FATAL ERROR:write_file: failed to create file /image/root/opt/libtorch/include/ATen/ops/as_strided_copy.h, because Too many open files
Parallel unsquashfs: Using 8 processors
74903 inodes (86780 blocks) to write

: exit status 1

The first 4 lines appear also in my log files for blockMesh and decomposePar, I am not aware if this is normal or not but I believe it has something to do with the Fatal error below.

Thank you in advance!

Hi @hyzahw,

it looks like an issue with the Apptainer container to me. In order to help you better, can you check the following:

  1. are you using Linux as native OS or are you using WSL (Linux subsystem for Windows)?
  2. did you follow the setup explained in exercise 2, especially the part regarding Apptainer and building the container?
  3. is the container located at the top-level of the repository?

Regards,
Janis

Hi Janis,

Thanks for your answer. I am using Ubuntu as a native OS. I checked again the whole steps of installing Apptainer and I manage to not get the errors now concerning squashfuse and fuse2fs so I guess the Apptainer problem is solved.

I get now this error when starting the runParallel renumberMesh step in ./Allrun :

--> FOAM FATAL ERROR: (openfoam-2206)
attempt to run parallel on 1 processor

    From static bool Foam::UPstream::init(int&, char**&, bool)
    in file UPstream.C at line 286.

FOAM aborting

I tried ./Allrun command in another tutorial that ran also in parallel. decomposePar was executed normally also as before and the same error came once the runParallel step was executed. Do you have any recommendations?

This may be an issue with MPI, can you check which version is installed by executing the command mpiexec --version in a terminal? You can further check if the simulation is executed when running it on only one CPU. Therefore, just change the Allrun script from:

# decompose and run case
runApplication decomposePar
runParallel renumberMesh -overwrite
runParallel $(getApplication)

to:

# decompose and run case
# runApplication decomposePar
# runParallel renumberMesh -overwrite
runApplication $(getApplication)

Edit: I get this error when numberOfSubdomains in system/decomposedParDict is set to 1. Can you check if this parameter is set to 2 in your case? It should look like: numberOfSubdomains 2;

Regards
Janis

I was using mpi version 3.3.2 when I encountered the problem. Now I upgraded to 5.0.1 and I still get the same error. Can you mention the version that you are using? Should there be a compatibility between OF2206 and a specific MPI version?

The case runs normally if I don't decompose.

The decomposeParDict is similar to the original file in test_cases

numberOfSubdomains  2;

method              hierarchical;

coeffs
{
    n               (2 1 1);
}

Hi @hyzahw,

I have MPI version 4.0.3 installed, as far as I know the Apptainer container uses MPI version 4.1.2. I have encountered these issues with executing simulations when there is a too large difference in between MPI version in the past.

If it is not too much trouble, you can try to install the MPI version used in your container (you can check which version is used by executing mpiexec --version inside the container). Alternatively, maybe @AndreWeiner has some additional ideas or tips which may help you with the issue.

Regards
Janis

After re-installing MPI version 4.1.2, the simulation was run in parallel normally.

Thanks @JanisGeise! ;)