Setting process as leader on Win7 results in access error

Question

Setting process as leader on Win7 results in access error

lmtierney opened this issue 7 years ago · comments

With a recent change in selenium (e7d4f5e), a user was getting access denied errors on Win7 but not Win8 or Win10 (see SeleniumHQ/selenium#3512). Chromedriver would start, but the resulting Chrome instance would start and die.

I traced it down to the assign_process_to_job_object method call here. I found a note in the hProcess [in] parameter when dealing with Windows 7 that might be relevant to the issue:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms681949(v=vs.85).aspx

@titusfortner or @p0deje anything else to add?

Eric Kessler · Answer 1 · Mon Feb 20 2017 09:22:10 GMT+0800 (China Standard Time)

Thanks for bringing this to my attention! I'll take a look at it this coming week.

Eric Kessler · Answer 2 · Mon Feb 27 2017 00:16:02 GMT+0800 (China Standard Time)

@lmtierney That flag does look like it would fix the issue. Unfortunately, I do not have any good way of testing on Windows 7 in order to know if the fix works. I have recently added Appveyor to the project in order to get Windows builds but, although I can effectively control 32/64 bit status and Ruby versions, I have yet to figure out if it is possible to use it to test against a particular version of Windows.

Any help in testing this fix would be appreciated.

Lucas Tierney · Answer 3 · Mon Feb 27 2017 09:44:02 GMT+0800 (China Standard Time)

I got a Win7 image from Microsoft here if you want one. Otherwise, I believe I still have everything in the correct state if you want me to check it on a branch

Eric Kessler · Answer 4 · Wed Mar 01 2017 22:19:11 GMT+0800 (China Standard Time)

@lmtierney I was able to lay my hands on a Windows 7 machine, but I'm not seeing the error. Maybe I'm not reproducing it right or maybe there's more to the problem. Are you able to reliably reproduce the issue and can it be reproduced using only child_process or do we have to go through selenium to trigger it?

Lucas Tierney · Answer 5 · Wed Mar 01 2017 22:45:31 GMT+0800 (China Standard Time)

I can reproduce it with selenium, will have to think about how to reproduce it with just child_process. I believe the issue is when the child process is trying to spawn a new process itself.

With selenium-webdriver 3.1.0:

require 'selenium-webdriver'
$DEBUG = true
driver = Selenium::WebDriver.for(:chrome)
driver.quit

Eric Kessler · Answer 6 · Wed Mar 01 2017 22:51:31 GMT+0800 (China Standard Time)

Right. And there are tests in the test suite that spawn new children (including detached children). Detaching a child is what runs through the 'leader' logic and so I was expecting the suite to fail on a Win7 machine.

Eric Kessler · Answer 7 · Thu Mar 02 2017 00:11:11 GMT+0800 (China Standard Time)

Yeah, I'm able to reproduce it with selenium now. I'll try the fix tonight.

Eric Kessler · Answer 8 · Thu Mar 02 2017 21:57:35 GMT+0800 (China Standard Time)

Good news: We are on the right track and the inclusion of the CREATE_BREAKAWAY_FROM_JOB flag during process creation does fix the issue and the test suite runs green again. I was expecting to have to use JOB_OBJECT_LIMIT_BREAKAWAY_OK as well but just the one flag seems to get things passing.

Bad news: This is not the first time that this project has been down this path. This problem has come up multiple times before and varying flavors of flags have been used to fix, unfix, and maybe fix things in the past. The most recent attempt was mere months before I took over as maintainer of the project.

For your reference: #76 #96 #99

After looking over the historical issues and the MSDN documentation, I am inclined to say that the correct approach is to use both CREATE_BREAKAWAY_FROM_JOB and JOB_OBJECT_LIMIT_BREAKAWAY_OK when creating processes and jobs, respectively (always for jobs so that leader children can be created and sometimes for the process if they are a leader). My concern at this moment is whether or not some other behavior will stop working with the reintroduction of these flags, as was hinted at in #76.

@lmtierney You've been a great help so far and I appreciate the continued presence of a second set of eyes on this. However, because this particular problem has tended to not stay solved, I'd also like to pull @jarib back into this one, if only for a sanity check and any additional historical perspective.

Jari Bakken · Answer 9 · Thu Mar 02 2017 22:13:04 GMT+0800 (China Standard Time)

I can't remember all the details here, but reading through the issues it seems like it stranded on the issue of fixing #96 without re-introducing the problem in #76 – where a PR was promised but never submitted.

I would thread carefully here, since not killing child processes correctly can really screw things up for users – especially if they're starting a lot of browsers with Selenium. Hopefully it's possible to write some test scripts that explores the behaviour of the various flags to make sure the older issues aren't re-introduced (especially #76).

Sorry to not be of more use!

EDIT: You could also consider pushing a pre-release gem with the change and ask Windows users to test it.

Eric Kessler · Answer 10 · Mon Mar 06 2017 13:54:52 GMT+0800 (China Standard Time)

The fix is in on its own branch (win7_leader_fixes).

@lmtierney , please try using that branch and see if you encounter any problems with selenium. The sample code that you provided seems to be working fine with the fix.

@jarib , regrarding older issues:

#76 The test code that the two of you and @asiazhang provided seems to be working fine and as expected. A parent process creates a child process that, in turn, creates a grandchild process (which is Notepad). In your code, you would not see Notepad because the child process doesn't wait on the the grandchild to finish and immediately returns. Because the child process is a leader, it takes the grandchild process with it before it can even bring up a window. In @asiazhang's code, they added a #wait, which caused the child process to stick around until the user closed Notepad themselves, which is why they actually saw Notepad launch. @asiazhang's additional comment that creating a breakaway process would prevent the entire tree from being killed is something that I can't explain because killing the entire tree can only happen by using a new job group (which requires that flag). TLDR: there is no problem to begin with.
#96 Conditionalizing the breakaway flag based on @leader as @pythianemord suggested is the route that I ended up taking. I may look into the process handle race condition but I wouldn't expect a PR to ever happen because it looks like @pythianemord is inactive.
#99 I'm still trying to figure this one out. We've got tests that ensure that using #detach lets the child process live on and that test passes on Windows. The code provided by @hferentschik, however, does seem to encounter some kind detachment problem. It doesn't look particularly related to this issue, in any case.

Oddities:

Without the fix, the test suite will fail on Win7 if I run it through another program (e.g. RubyMine), but not when I run it from the command line. This is good in that it means we have process tree killing tests that confirm the behavior is working but the fact that it only works through another program is a little concerning because it means that the CI builds won't fail if this becomes a problem (not that they would anyway because they don't build on Win7, but you know what I mean).
The JOB_OBJECT_LIMIT_BREAKAWAY_OK doesn't seem to be needed. I've included it because, according to the documentation and other users' fix attempts, it should be use. However, it doesn't seem to make a difference one way or the other (even when doing several levels of child jobs, all of whom are themselves leaders and thus constantly breaking away to new job groups).
Speaking of nesting leader processes: if you then kill one of them, all of their child processes die as well. While I get that the point of being a leader is to make sure that all of your children die with you, shouldn't subsequent children have avoided that when they themselves were declared a leader and thus got their own group to live in? Or would they have to be declared #detached in order to avoid death (and declared as both detached and leader if they wanted to both live on yet try to kill their own children later)? I'll admit that I'm not entirely clear on the line between those two properties.

Eric Kessler · Answer 11 · Wed Mar 08 2017 23:34:02 GMT+0800 (China Standard Time)

@lmtierney, have you gotten a chance to try out the new branch?

Lucas Tierney · Answer 12 · Wed Mar 08 2017 23:37:53 GMT+0800 (China Standard Time)

@enkessler sorry no things are a bit busy, I'll try to get it on my vm today

Eric Kessler · Answer 13 · Thu Mar 09 2017 00:02:24 GMT+0800 (China Standard Time)

No worries about being busy. The community has waited for a fix to this problem for years, so I suppose that a few more days couldn't hurt. ;)

Lucas Tierney · Answer 14 · Thu Mar 09 2017 00:03:14 GMT+0800 (China Standard Time)

@enkessler I tried it and it looks like it's working fine. I was able to spawn multiple chrome instances off of chromedriver as well

Eric Kessler · Answer 15 · Fri Mar 10 2017 00:20:31 GMT+0800 (China Standard Time)

@lmtierney This evening, I'm going to try and publish a pre-release version of the gem. Would you and the Selenium gang be interested in playing with the pre-release version for a bit in order to see if the fix works in a wider scope?

Eric Kessler · Answer 16 · Fri Mar 10 2017 14:37:09 GMT+0800 (China Standard Time)

@lmtierney Beta release done. Go nuts.

Lucas Tierney · Answer 17 · Fri Mar 10 2017 19:54:34 GMT+0800 (China Standard Time)

@enkessler Thanks! I'll see if the guy who keeps finding Windows issues for us can try this one out.

Eric Kessler · Answer 18 · Tue Mar 21 2017 00:34:13 GMT+0800 (China Standard Time)

@lmtierney Is there any news on the Win7 testing with the beta version of the gem? Good? Bad? Other?

Lucas Tierney · Answer 19 · Tue Mar 21 2017 00:44:21 GMT+0800 (China Standard Time)

@enkessler The selenium test suite runs without issue with the beta version. I was waiting for the guy who experienced the issue originally to try it out but he hasn't gotten around to it.

Eric Kessler · Answer 20 · Tue Mar 21 2017 00:52:21 GMT+0800 (China Standard Time)

Is that Selenium suite with or without the workaround?

Lucas Tierney · Answer 21 · Tue Mar 21 2017 02:05:26 GMT+0800 (China Standard Time)

I removed that conditional and ran it without

Eric Kessler · Answer 22 · Tue Mar 21 2017 02:23:50 GMT+0800 (China Standard Time)

Splendid! Now we just have to wait on that guy. The joys of a volunteer workforce...

Lucas Tierney · Answer 23 · Thu Mar 23 2017 00:55:22 GMT+0800 (China Standard Time)

@enkessler he has tried it and says everything looks good on his end 👍

Eric Kessler · Answer 24 · Thu Mar 23 2017 01:21:24 GMT+0800 (China Standard Time)

@lmtierney Okay. I'll re-release it tonight as an official release.

Thanks for the help!

Eric Kessler · Answer 25 · Sat Mar 25 2017 20:39:11 GMT+0800 (China Standard Time)

@lmtierney It's out the door. Enjoy!