HPCToolkit / hpctoolkit

HPCToolkit performance tools: measurement and analysis components

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HPCRUN: Segmentation fault (core dumped)

Lynd98 opened this issue · comments

The OpenMP Offloading directive option: nowait causes a segmentation fault executing hpcrun. This is AMD's ROCm 5.1.0 with clang++. If nowait is removed, hpcrun works OK.

To build and run the executable:

`#!/bin/bash
export OMP_NUM_THREADS=2
export PATH=$PATH:/opt/rocm-5.1.0/llvm/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm-5.1.0/llvm/lib
echo $OMP_NUM_THREADS
rm ./a.out
which clang
clang++ -O2 -g -DNDEBUG -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a nowait_offload_EnterData.cpp
./a.out

Here is the source code:

// Conversion of nowait_offload.f95 to CPP.

#include
#include <sys/time.h>
#include <omp.h>
using namespace std;

void foo(float a[],float b[],float c[]) {
const int nsize = 1000;
int i,j;
#pragma omp target enter data map(to:a[0:nsize],b[0:nsize],c[0:nsize])
for (j=0; j<200; j++) {
#pragma omp target teams distribute parallel for private(i) nowait
for (i=1;i<nsize;i++) {
a[i] = b[i] * c[i] + i;
}
}
#pragma omp target exit data map(from:a[0:nsize])
}

double mysecond()
{
struct timeval tp;
struct timezone tzp;
int i;

i = gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

int main() {
const int nsize = 1000;
float a[nsize], b[nsize], c[nsize];
int i;
double start, finish;
char* numThreads;

for (i=0; i<nsize; i++) {
a[i] = 0.0;
b[i] = i;
c[i] = 10.0;
}

start = mysecond();
#pragma parallel
(void) foo(a,b,c);
finish = mysecond();
numThreads = getenv ("OMP_NUM_THREADS");
printf("(# of foos)= %s, time=%f\n", numThreads, finish-start);

printf("a(1)=%f, a(2)=%f\n",a[1], a[2]);
if (a[1] != 11 || a[2] != 22) {
printf("ERROR: wrong answers\n");
return -1;
}
printf("Success: if a diagnostic line starting with DEVID was output\n");
}

To create the error:

hpcrun -e CPUTIME -e gpu=openmp -t ./a.out

error message Program received signal SIG37, Real-time event 37.

hpcrun -e CPUTIME -e gpu=openmp -t rocgdb ./a.out

Program received signal SIG37, Real-time event 37.
0x00007fffec620e89 in ?? () from /opt/rocm-5.1.0/llvm/bin/../lib/../../lib/libhsa-runtime64.so.1
(gdb) bt
#0 0x00007fffec620e89 in ?? () from /opt/rocm-5.1.0/llvm/bin/../lib/../../lib/libhsa-runtime64.so.1
#1 0x00007ffff7fe0b8a in ?? () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7fe0c91 in ?? () from /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff757e915 in __GI__dl_catch_exception (exception=, operate=, args=) at dl-error-skeleton.c:182
#4 0x00007ffff7fe50bf in ?? () from /lib64/ld-linux-x86-64.so.2
#5 0x00007ffff757e8b8 in __GI__dl_catch_exception (exception=, operate=, args=) at dl-error-skeleton.c:208
#6 0x00007ffff7fe45fa in ?? () from /lib64/ld-linux-x86-64.so.2
#7 0x00007ffff7fab34c in dlopen_doit (a=a@entry=0x7fffffffd760) at dlopen.c:66
#8 0x00007ffff757e8b8 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd700, operate=, args=) at dl-error-skeleton.c:208
#9 0x00007ffff757e983 in __GI__dl_catch_error (objname=0x24d1c0, errstring=0x24d1c8, mallocedp=0x24d1b8, operate=, args=) at dl-error-skeleton.c:227
#10 0x00007ffff7fabb59 in _dlerror_run (operate=operate@entry=0x7ffff7fab2f0 <dlopen_doit>, args=args@entry=0x7fffffffd760) at dlerror.c:170
#11 0x00007ffff7fab3da in __dlopen (file=, mode=) at dlopen.c:87
#12 0x00007ffff76a2f48 in RTLsTy::LoadRTLs() () from /opt/rocm-5.1.0/llvm/bin/../lib/libomptarget.so
#13 0x00007ffff761f47f in __pthread_once_slow (once_control=0x24d03c, init_routine=0x7ffff7aa1c20 <__once_proxy>) at pthread_once.c:116
#14 0x00007ffff76937dd in __tgt_register_lib () from /opt/rocm-5.1.0/llvm/bin/../lib/libomptarget.so
#15 0x000000000020a6fd in __libc_csu_init ()
#16 0x00007ffff7442040 in __libc_start_main (main=0x7ffff7bd60b0 <monitor_main>, argc=1, argv=0x7fffffffdb28, init=0x20a6b0 <__libc_csu_init>, fini=, rtld_fini=,
stack_end=0x7fffffffdb18) at ../csu/libc-start.c:264
#17 0x00007ffff7bd5757 in __libc_start_main () from /opt/tools/hpctoolkit/hpctoolkit-build/../hpctoolkit-install/lib/hpctoolkit/ext-libs/libmonitor.so
#18 0x0000000000209e1e in _start ()

Thanks for your test code. It indeed highlights a problem but I am not sure whether the problem is with hpctoolkit or the openmp implementation.

There were two issues with the program you pasted:

  1. There is a line that contains simply '#include'. I updated it to '#include <stdio.h>'
  2. There is a line that says '#pragma parallel'. I am not sure what you intend here.

If I change that line to '#pragma omp parallel', the program runs to completion with hpcrun, though it prints out 'ERROR: wrong answers'. That is exactly the same behavior I get without hpcrun after making the above two changes.

Leaving the '#pragma parallel' as is (a no op)

If I drop the "nowait" from your directive "#pragma omp target teams distribute parallel for private(i) nowait", the program runs fine with hpcrun.

With the nowait, hpcrun dumps core. hpcrun expects that any task is created within the scope of a parallel region. hpcrun knows when a parallel region has been entered because it receives a callback from the OpenMP OMPT tools API. However, the task created by the nowait lacks an enclosing parallel region. I don't know whether this is a problem with the OpenMP implementation or hpcrun's expectations. I looked at the OpenMP spec and don't know whether anyone has thought this through. I will post an inquiry to the OpenMP language committee mailing list.

FYI: with our installation of rocm/5.1.0, there is no clang++. The compiler appears to have been renamed 'amdclang++'. I don't know why my rocm 5.1.0 differs from yours.

FYI: SIG37 is a realtime signal that is set up by hpctoolkit for asyncrhonous sampling using a Linux CPUTIME timer. When debugging programs run using hpctoolkit, you can use the following gdb command to forward those signals directly to hpctoolkit's measurement subsystem:
handle SIG37 nostop noprint pass

Subject: A question about OpenMP tasks for target nowait
Date: June 5, 2022 at 12:48:54 PM CDT
To: omp-lang@openmp.org, omp-lang@mailman.openmp.org

I received a bug report for HPCToolkit’s support for the following directive

#pragma omp target teams distribute parallel for private(i) nowait

and I am not sure whether there is a problem with our tool’s expectations or the OpenMP implementation.

When compiled with amdclang++ for offloading on an AMD GPU, the above statement in red in the context of the program below calls __kmpc_omp_task in the LLVM OpenMP runtime outside the scope of a parallel region.

HPCToolkit expects that every task is created within the scope of a parallel region. In this case, the LLVM OpenMP implementation doesn’t bracket this task creation with OMPT callbacks to begin/end a parallel region.

Is my expectation wrong? Is it fine for the construct in red to create an OpenMP task outside the scope of a parallel region? Or, is there some implicit parallel region in which the task is created?

// Conversion of nowait_offload.f95 to CPP.

#include <stdio.h>
#include <sys/time.h>
#include <omp.h>
using namespace std;

void foo(float a[],float b[],float c[]) {
const int nsize = 1000;
int i,j;
#pragma omp target enter data map(to:a[0:nsize],b[0:nsize],c[0:nsize])
for (j=0; j<200; j++) {
#pragma omp target teams distribute parallel for private(i) nowait
for (i=1;i<nsize;i++) {
a[i] = b[i] * c[i] + i;
}
}
#pragma omp target exit data map(from:a[0:nsize])
}

double mysecond()
{
struct timeval tp;
struct timezone tzp;
int i;

i = gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

int main() {
const int nsize = 1000;
float a[nsize], b[nsize], c[nsize];
int i;
double start, finish;
char* numThreads;

for (i=0; i<nsize; i++) {
a[i] = 0.0;
b[i] = i;
c[i] = 10.0;
}

start = mysecond();
(void) foo(a,b,c);
finish = mysecond();
numThreads = getenv ("OMP_NUM_THREADS");
printf("(# of foos)= %s, time=%f\n", numThreads, finish-start);

printf("a(1)=%f, a(2)=%f\n",a[1], a[2]);
if (a[1] != 11 || a[2] != 22) {
printf("ERROR: wrong answers\n");
return -1;
}
}

A consultation with the OpenMP language committee yielded the following:

a target nowait can create a task outside an explicit parallel region. its parallel region parent is the implicit parallel region surrounding the implicit task.

since the implicit parallel region doesn't get a parallel begin callback, hpctoolkit hadn't recorded a call path for the parallel region. the lack of a call path for the enclosing (implicit) parallel region caused the core dump. the solution for a target nowait task nested only in the implicit parallel region is for the new task to unwind the call stack in its OMPT task creation callback to determine the creation calling context for itself.

fixed in hpctoolkit master by #586.