Defect: Image number dependant MPI_Win_Lock error
Oiubrab opened this issue · comments
System Information:
- OpenCoarrays Version: OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.9.2)
- Fortran Compiler: GNU Fortran (GCC) 11.1.0
- C compiler used for building lib:
- Installation method: install.sh
- All flags & options passed to the installer: all default
- Output of
uname -a
:Linux manjaro 5.13.5-1-MANJARO #1 SMP PREEMPT Mon Jul 26 07:43:29 UTC 2021 x86_64 GNU/Linux - MPI library being used: openmpi 4.1.1-2
- Machine architecture and number of physical cores: 8th/9th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S] (16 threads)
- cmake version 3.21.1
Note, in running mpicc -show
, I get /opt/nvidia/hpc_sdk/Linux_x86_64/21.1/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpicc: error while loading shared libraries: libnvcpumath.so: cannot open shared object file: No such file or directory
The issue
What I was trying to do
I was trying to run four concurrent images, executing the compilation of my code, found at https://github.com/Oiubrab/byinheritance, executing sudo chmod u+x i_am_in_command.zsh && ./i_am_in_command.zsh clean 2 test print
. The why is described in the github readme, found in the link, but basically I have created a neural network that computes a trading action in fortran. Ultimately, the pertinent execution lies in the aforementioned bash script line cafrun -n 4 --use-hwthread-cpus ./lack_of_comprehension $3
.
What Happened
When this line is run, there is an mpi error generated and, having put in two print statements to catch the error, where place represents the order of the placing of the statements, I get:
Invalid Trades:
[1, 0, 0, 0, 0]
0
Network Choice:
[0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0]
[0, 0, 0]
Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.33, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.115, 'units_owned': 0}
Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241351.6863492}
run: 1
image number: 1 Place: 1
image number: 2 Place: 1
image number: 3 Place: 1
image number: 4 Place: 1
image number: 2 Place: 2
image number: 3 Place: 2
image number: 1 Place: 2
[manjaro:25102] *** An error occurred in MPI_Win_detach
[manjaro:25102] *** reported by process [3482124289,0]
[manjaro:25102] *** on win rdma window 5
[manjaro:25102] *** MPI_ERR_UNKNOWN: unknown error
[manjaro:25102] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:25102] *** and potentially your MPI job)
[manjaro:25098] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:25098] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
`/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./lack_of_comprehension test`
failed to run.
Invalid Trades:
[0, 0, 1, 0, 0]
2
Network Choice:
[0, 0, 0, 0, 0, 0, 0] [1, 0, 0, 0, 0, 0, 0] [0, 1, 0, 1, 0, 0, 0]
[0, -1, -10]
Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.33, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.115, 'units_owned': 0}
Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241360.06828}
run: 2
image number: 1 Place: 1
image number: 2 Place: 1
image number: 3 Place: 1
image number: 4 Place: 1
image number: 1 Place: 2
image number: 2 Place: 2
image number: 3 Place: 2
[manjaro:25174] *** An error occurred in MPI_Win_detach
[manjaro:25174] *** reported by process [3486973953,2]
[manjaro:25174] *** on win rdma window 5
[manjaro:25174] *** MPI_ERR_UNKNOWN: unknown error
[manjaro:25174] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:25174] *** and potentially your MPI job)
[manjaro:25168] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:25168] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
`/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./lack_of_comprehension test`
failed to run.
Invalid Trades:
[0, 0, 1, 0, 0]
2
Network Choice:
[1, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0] [0, 1, 0, 0, 0, 0, 0]
[-1, 0, -2]
Market Prices and Info:
{'stock_identifier': 'SE1', 'stock_number': 1, 'stock_price': 0.39, 'units_owned': 0}
{'stock_identifier': 'ADV', 'stock_number': 2, 'stock_price': 0.001, 'units_owned': 0}
{'stock_identifier': 'SBR', 'stock_number': 3, 'stock_price': 0.12, 'units_owned': 0}
Account Position:
{'account': 'test', 'account_value': 3000.0, 'time': 1628241366.8761365}
What I expected to happen
Markets and network choices vary. This is expected. What is not expected is the error and the fact that the fourth image does not run to the second place. I should see the output above, but without the error, and with a image number: 4 Place: 2
line. This exact code (minus the print statements) ran without a hitch with the last version of openmpi (openmpi-4.0.5-3-x86_64). I have since tried to run other opencoarrays programs I have written and found various errors trying to run less than six threads.
Step by step reproduction
This error can be reproduced following the execution above. As this error seems to be code agnostic, you can also try running the process below to reproduce a similar error (again, this code was running previous to the update):
step 1
Take the following code and save as an f95 file (e.g test_arraymove.f95):
program test_arraycom
real,dimension(10) , codimension [*] :: x , y
integer :: num_img , me
num_img = num_images()
me = this_image ()
print*,me,num_img
! Some code here
x (2) = x ( 3 ) [ 6 ]! get value from image 6
x ( 6 ) [ 4 ] = x (1)! put value on image 4
x ( : ) [ 2 ] = y ( : )! put array on image 2
sync all
! Remote−to−remote array transfer
if(me == 1)then
y(:)[num_img]=x(:)[ 4 ]
sync images (num_img)
else if(me == num_img) then
sync images ([ 1 ])
end if
x(1:10:2)=y(1:10:2)[4]! strided get from 4
end program
step 2
compile the code with caf test_arraymove.f95 -o programname
step 3
run the code, with the number of CPU threads, $2, below 6
i.e cafrun -n $2 --use-hwthread-cpus ./programname
step 4
get an error of the form:
1 4
2 4
3 4
4 4
[manjaro:28814] *** An error occurred in MPI_Win_lock
[manjaro:28814] *** reported by process [3708682241,1]
[manjaro:28814] *** on win rdma window 6
[manjaro:28814] *** MPI_ERR_RANK: invalid rank
[manjaro:28814] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[manjaro:28814] *** and potentially your MPI job)
[manjaro:28809] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[manjaro:28809] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
`/usr/bin/mpiexec -n 4 --use-hwthread-cpus ./testarraymove`
failed to run.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.