fabtests/shm_sighandler_test does not have consistent behavior

Question

fabtests/shm_sighandler_test does not have consistent behavior

wckzhang opened this issue 2 years ago · comments

I noticed that executing the shm_sighandler_test twice in a row causes inconsistent behavior.

[ec2-user@compute-st-c5n18xlarge-1 bin]$ ./shm_sighandler_test 
Pass: child caught SIGINT and exited as expected
[ec2-user@compute-st-c5n18xlarge-1 bin]$ ./shm_sighandler_test 
fi_getinfo(): common/shared.c:903, ret=-61 (No data available)
Failed to initialize shm provider
Fail: child killed by SIGKILL or exited with error

I noticed that ft_free_res() is not called, so even though it's catching SIGINT, something is causing shm to not be able to initialize again.

aingerson · Answer 1 · Fri May 20 2022 06:24:58 GMT+0800 (China Standard Time)

I ran it just now about 20 times in a row and don't see an issue. Can you provide a debug log?

William Zhang · Answer 2 · Fri May 20 2022 06:54:44 GMT+0800 (China Standard Time)

I ran this code on an ubuntu machine and it didn't have the same behavior (ie. it worked every time). It hit this issue when running on Amazon Linux (I was able to reproduce it with the exact same commands as I ran on my ubuntu machine. Looking at the debug log right now

William Zhang · Answer 3 · Fri May 20 2022 06:55:26 GMT+0800 (China Standard Time)

This is the debug log:

libfabric:9683:1653000816::core:core:fi_getinfo_():1119<debug> hints prov_name: shm
libfabric:9683:1653000816::core:core:ofi_layering_ok():1074<info> Skipping shm;ofi_rxm layering
libfabric:9683:1653000816::core:core:ofi_layering_ok():1074<info> Skipping shm;ofi_rxd layering
libfabric:9683:1653000816::shm:core:util_getinfo():149<debug> checking info
libfabric:9683:1653000816::shm:core:smr_getinfo():172<info> mr_mode does not match FI_HMEM capability.
libfabric:9683:1653000816::core:core:fi_getinfo_():1153<warn> fi_getinfo: provider shm returned -61 (No data available)
libfabric:9683:1653000816::core:core:ofi_layering_ok():1074<info> Skipping shm;ofi_mrail layering
fi_getinfo(): common/shared.c:903, ret=-61 (No data available)
Failed to initialize shm provider

Sean Hefty · Answer 4 · Fri May 20 2022 07:24:16 GMT+0800 (China Standard Time)

How does the execution of one process impact the fi_getinfo call of a separate process run after the first one exits? That makes no sense.

Jianxin Xiong · Answer 5 · Fri May 20 2022 08:51:43 GMT+0800 (China Standard Time)

Code from smr_getinfo():

131:        mr_mode = hints && hints->domain_attr ? hints->domain_attr->mr_mode :
132:                                                FI_MR_VIRT_ADDR | FI_MR_HMEM;
            ......
147:        for (cur = *info; cur; cur = cur->next) {
                    ......
167:                if (cur->caps & FI_HMEM) {
168:                        if (!(mr_mode & FI_MR_HMEM)) {
169:                                fi_freeinfo(cur);
170:                                FI_INFO(&smr_prov, FI_LOG_CORE,
171:                                        "mr_mode does not match FI_HMEM capability.\n");
172:                                return -FI_ENODATA;
173:                        } 
174:                        cur->domain_attr->mr_mode |= FI_MR_HMEM;
175:                } else {
176:                        cur->domain_attr->mr_mode &= ~FI_MR_HMEM;
177:                }
178:         }

hints is allocated with fi_allocinfo() and hints->domain_attr is non-NULL and hints->domain_attr->mr_mode is initialized to 0. That should always lead to L171, but apparently the successful runs didn't. Am I missing something here?

You could try adding hints->domain_attr->mr_mode = FI_MR_VIRT_ADDR | FI_MR_HMEM; to the test to see if it helps.

Sean Hefty · Answer 6 · Fri May 20 2022 08:59:30 GMT+0800 (China Standard Time)

The test doesn't set any caps, which is unusual, since it does specify the provider name.

Sean Hefty · Answer 7 · Fri May 20 2022 09:04:11 GMT+0800 (China Standard Time)

shm has 2 fi_info structs, one for normal use, and another for hmem. The hmem failure in the log looks like a correct failure. But the other fi_info (smr_info) should have been returned.

Jianxin Xiong · Answer 8 · Fri May 20 2022 09:25:12 GMT+0800 (China Standard Time)

At L172 the failed info struct should be removed from the result instead of returning -ENODATA immediately.

Sean Hefty · Answer 9 · Fri May 20 2022 09:55:19 GMT+0800 (China Standard Time)

util_getinfo() should have checked the shm_info list against the hints and only selected matching entries. One problem is that shm didn't setup a domain attribute with the mr_mode bit set correctly. That would remove the need for this post processing of the info list for hmem support.

aingerson · Answer 10 · Sat May 21 2022 01:21:30 GMT+0800 (China Standard Time)

I think the reason it did it after the util_getinfo call was to not have to duplicate the whole domain_attr for the hmem info. We could duplicate it and have util_getinfo deal with it or keep this post processing.
I think the issue is that it is returning ENODATA right away. Theoretically that shouldn't be an issue since the hmem info is last and this case is handling if the app requested hmem info (ie got the info back from util_getinfo) but didn't set the mr hmem mode. But this is not handling the caps = 0 case which ends up just returning the prov info caps which leads to this case. We'll probably still need this post processing even when splitting up the domain_attrs because of the caps = 0 case so I'm not sure it's worth it. I'll fix it up and see if it fixes this issue.

Sean Hefty · Answer 11 · Sat May 21 2022 01:32:13 GMT+0800 (China Standard Time)

We duplicated the other structures just to OR in a flag. I'd duplicate the domain_attr to push the checks into util_getinfo. Util_getinfo should set the caps for the output correctly. With hints->caps of 0, the only caps returned will be the secondary caps, which are pretty useless by themselves, but hey, I guess that's supported by the API.

But I don't see how any of this is the issue. The behavior is different between runs, but only on a specific OS? That still doesn't make sense.

aingerson · Answer 12 · Sat May 21 2022 01:34:39 GMT+0800 (China Standard Time)

I'm going to be OOO the rest of the day but I'll take a closer look this weekend at what the util code is doing and create a better fix. For now, I'm curious if #7776 fixes this issue. That will provide more information as to what's going on here.

But I agree, different runs and specific OS don't make sense for this issue. Will investigate

Sean Hefty · Answer 13 · Sat May 21 2022 02:10:16 GMT+0800 (China Standard Time)

Can you also check #7777 and let us know if that makes a change? And if it does why? :)

Is there any chance the (more general libfabric) hmem support fails the second time the test is run on this system? That is, something is going weird with the GPU runtime? If you wait several seconds between runs, will the test pass again?

William Zhang · Answer 14 · Sat May 21 2022 02:14:24 GMT+0800 (China Standard Time)

Oh now that I think about it, my ubuntu machine has a cuda device, so that might be it.

William Zhang · Answer 15 · Sat May 21 2022 04:28:07 GMT+0800 (China Standard Time)

Both #7776 and #7777 fix the issue as far as I can tell.

William Zhang · Answer 16 · Sat May 21 2022 04:39:36 GMT+0800 (China Standard Time)

Attached full debug logs.
shmsighandleroutput.txt

What I notice is that the first run will typically succeed, then any subsequent runs will fail (ie. if I have two compute nodes with a shared libfabric/fabtests install, if I ssh into the first node and run the test, it'll pass, then any runs will fail. Then if I ssh into the second node, I see the same behavior.)

Sean Hefty · Answer 17 · Tue May 24 2022 04:54:44 GMT+0800 (China Standard Time)

Neither #7776 or #7777 should fix this issue. Looking at shm, there is a problem in the code, but the results (right or wrong) should be consistent from run to run. After the first run, are any processes left running?

I think there's some other path here that's failing.

Sean Hefty · Answer 18 · Tue May 24 2022 05:20:29 GMT+0800 (China Standard Time)

The output logs from the first run look very different from the second. Why is EFA showing up in the second run, but not the first?

Shi Jin · Answer 19 · Wed Jun 01 2022 07:49:22 GMT+0800 (China Standard Time)

I could reproduce the error consistently on a machine with amazon linux 2 or ubuntu 1804

(env) [ec2-user@ip-172-31-27-163 fabtests]$ ./install/bin/shm_sighandler_test
fi_getinfo(): common/shared.c:903, ret=-61 (No data available)
Failed to initialize shm provider
Fail: child killed by SIGKILL or exited with error

ubuntu@ip-172-31-59-117:~/libfabric/fabtests$ ./install/bin/shm_sighandler_test
fi_getinfo(): common/shared.c:903, ret=-61 (No data available)
Failed to initialize shm provider
Fail: child killed by SIGKILL or exited with error

@wckzhang what is the ubuntu AMI you were using?

William Zhang · Answer 20 · Wed Jun 01 2022 22:29:12 GMT+0800 (China Standard Time)

ami-0f8c1b9de5e8d8095 - Deep Learning AMI (Ubuntu 18.04) Version 53.0. This was on a p3dn with EFA and a cuda device

William Zhang · Answer 21 · Thu Feb 23 2023 06:03:54 GMT+0800 (China Standard Time)

Closing the issue since fix has been merged