ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fabtests/shm_sighandler_test does not have consistent behavior

wckzhang opened this issue · comments

I noticed that executing the shm_sighandler_test twice in a row causes inconsistent behavior.

[ec2-user@compute-st-c5n18xlarge-1 bin]$ ./shm_sighandler_test 
Pass: child caught SIGINT and exited as expected
[ec2-user@compute-st-c5n18xlarge-1 bin]$ ./shm_sighandler_test 
fi_getinfo(): common/shared.c:903, ret=-61 (No data available)
Failed to initialize shm provider
Fail: child killed by SIGKILL or exited with error

I noticed that ft_free_res() is not called, so even though it's catching SIGINT, something is causing shm to not be able to initialize again.

I ran it just now about 20 times in a row and don't see an issue. Can you provide a debug log?

I ran this code on an ubuntu machine and it didn't have the same behavior (ie. it worked every time). It hit this issue when running on Amazon Linux (I was able to reproduce it with the exact same commands as I ran on my ubuntu machine. Looking at the debug log right now

This is the debug log:

libfabric:9683:1653000816::core:core:fi_getinfo_():1119<debug> hints prov_name: shm
libfabric:9683:1653000816::core:core:ofi_layering_ok():1074<info> Skipping shm;ofi_rxm layering
libfabric:9683:1653000816::core:core:ofi_layering_ok():1074<info> Skipping shm;ofi_rxd layering
libfabric:9683:1653000816::shm:core:util_getinfo():149<debug> checking info
libfabric:9683:1653000816::shm:core:smr_getinfo():172<info> mr_mode does not match FI_HMEM capability.
libfabric:9683:1653000816::core:core:fi_getinfo_():1153<warn> fi_getinfo: provider shm returned -61 (No data available)
libfabric:9683:1653000816::core:core:ofi_layering_ok():1074<info> Skipping shm;ofi_mrail layering
fi_getinfo(): common/shared.c:903, ret=-61 (No data available)
Failed to initialize shm provider

How does the execution of one process impact the fi_getinfo call of a separate process run after the first one exits? That makes no sense.

Code from smr_getinfo():

131:        mr_mode = hints && hints->domain_attr ? hints->domain_attr->mr_mode :
132:                                                FI_MR_VIRT_ADDR | FI_MR_HMEM;
            ......
147:        for (cur = *info; cur; cur = cur->next) {
                    ......
167:                if (cur->caps & FI_HMEM) {
168:                        if (!(mr_mode & FI_MR_HMEM)) {
169:                                fi_freeinfo(cur);
170:                                FI_INFO(&smr_prov, FI_LOG_CORE,
171:                                        "mr_mode does not match FI_HMEM capability.\n");
172:                                return -FI_ENODATA;
173:                        } 
174:                        cur->domain_attr->mr_mode |= FI_MR_HMEM;
175:                } else {
176:                        cur->domain_attr->mr_mode &= ~FI_MR_HMEM;
177:                }
178:         }

hints is allocated with fi_allocinfo() and hints->domain_attr is non-NULL and hints->domain_attr->mr_mode is initialized to 0. That should always lead to L171, but apparently the successful runs didn't. Am I missing something here?

You could try adding hints->domain_attr->mr_mode = FI_MR_VIRT_ADDR | FI_MR_HMEM; to the test to see if it helps.

The test doesn't set any caps, which is unusual, since it does specify the provider name.

shm has 2 fi_info structs, one for normal use, and another for hmem. The hmem failure in the log looks like a correct failure. But the other fi_info (smr_info) should have been returned.

At L172 the failed info struct should be removed from the result instead of returning -ENODATA immediately.

util_getinfo() should have checked the shm_info list against the hints and only selected matching entries. One problem is that shm didn't setup a domain attribute with the mr_mode bit set correctly. That would remove the need for this post processing of the info list for hmem support.

I think the reason it did it after the util_getinfo call was to not have to duplicate the whole domain_attr for the hmem info. We could duplicate it and have util_getinfo deal with it or keep this post processing.
I think the issue is that it is returning ENODATA right away. Theoretically that shouldn't be an issue since the hmem info is last and this case is handling if the app requested hmem info (ie got the info back from util_getinfo) but didn't set the mr hmem mode. But this is not handling the caps = 0 case which ends up just returning the prov info caps which leads to this case. We'll probably still need this post processing even when splitting up the domain_attrs because of the caps = 0 case so I'm not sure it's worth it. I'll fix it up and see if it fixes this issue.

We duplicated the other structures just to OR in a flag. I'd duplicate the domain_attr to push the checks into util_getinfo. Util_getinfo should set the caps for the output correctly. With hints->caps of 0, the only caps returned will be the secondary caps, which are pretty useless by themselves, but hey, I guess that's supported by the API.

But I don't see how any of this is the issue. The behavior is different between runs, but only on a specific OS? That still doesn't make sense.

I'm going to be OOO the rest of the day but I'll take a closer look this weekend at what the util code is doing and create a better fix. For now, I'm curious if #7776 fixes this issue. That will provide more information as to what's going on here.

But I agree, different runs and specific OS don't make sense for this issue. Will investigate

Can you also check #7777 and let us know if that makes a change? And if it does why? :)

Is there any chance the (more general libfabric) hmem support fails the second time the test is run on this system? That is, something is going weird with the GPU runtime? If you wait several seconds between runs, will the test pass again?

Oh now that I think about it, my ubuntu machine has a cuda device, so that might be it.

Both #7776 and #7777 fix the issue as far as I can tell.

Attached full debug logs.
shmsighandleroutput.txt

What I notice is that the first run will typically succeed, then any subsequent runs will fail (ie. if I have two compute nodes with a shared libfabric/fabtests install, if I ssh into the first node and run the test, it'll pass, then any runs will fail. Then if I ssh into the second node, I see the same behavior.)

Neither #7776 or #7777 should fix this issue. Looking at shm, there is a problem in the code, but the results (right or wrong) should be consistent from run to run. After the first run, are any processes left running?

I think there's some other path here that's failing.

The output logs from the first run look very different from the second. Why is EFA showing up in the second run, but not the first?

I could reproduce the error consistently on a machine with amazon linux 2 or ubuntu 1804

(env) [ec2-user@ip-172-31-27-163 fabtests]$ ./install/bin/shm_sighandler_test
fi_getinfo(): common/shared.c:903, ret=-61 (No data available)
Failed to initialize shm provider
Fail: child killed by SIGKILL or exited with error
ubuntu@ip-172-31-59-117:~/libfabric/fabtests$ ./install/bin/shm_sighandler_test
fi_getinfo(): common/shared.c:903, ret=-61 (No data available)
Failed to initialize shm provider
Fail: child killed by SIGKILL or exited with error

@wckzhang what is the ubuntu AMI you were using?

ami-0f8c1b9de5e8d8095 - Deep Learning AMI (Ubuntu 18.04) Version 53.0. This was on a p3dn with EFA and a cuda device

Closing the issue since fix has been merged