jonhoo / fantoccini

A high-level API for programmatically interacting with web pages through WebDriver.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

panicked at 'internal error: entered unreachable code: received unknown error (timeout)

leaty opened this issue · comments

commented

When crawling a website, I get this when it happens upon a certain page:

thread 'tokio-runtime-worker' panicked at 
'internal error: entered unreachable code: received unknown error (timeout) for INTERNAL_SERVER_ERROR status code',
/home/spooder/.cargo/registry/src/github.com-1ecc6299db9ec823/fantoccini-0.15.0/src/session.rs:806:34

It also seems geckodriver dies at this point, as I'll get the following on the next pages.

webdriver connection lost: WebDriver session was closed while waiting
webdriver connection lost: WebDriver session has been closed
webdriver connection lost: WebDriver session has been closed
// etc

Is there a way to circumvent this error? Anything I could do about it?

commented

Also, to be clear. It doesn't matter if I'm unable to scrape that specific page, I just want to keep geckodriver from dying.

Huh, that's interesting. The webdriver spec does say that "timeout" is a valid error code, specifically with the meaning:

An operation did not complete before its timeout expired.

What operation were you trying to do when this error occurred?

commented

Sorry, at this time I don't know exactly which operation causes it, but these are the only ones I use:

client.goto(url).await?;
client.find_all(Locator::Css("a")).await?;

// Then for each <a> tag
link.attr("href").await?;

The error suggests to me that it's the browser window that basically ends up hanging. What do you see in the window?

commented

Interesting thought, I'll try running it non-headless.

commented

Hello! I got some time for this again, very sorry for the late update.

So apparently when running it non-headless, I saw a download window pop up, asking me to save something somewhere. After this happens, it just stands there and eventually fantoccini times out the connection to the webdriver, the webdriver however sits there alive and well until my crawler reaches the finish line.

Thus the "error" is clearly not related to fantoccini, it just waits until it times out because the webdriver did not respond in time. But if at all possible, I'd be very happy to hear some ideas on how one could circumvent this.

Could you for example:

  1. Disable all downloads in its entirety? Because limiting certain links is impossible since any one of them could redirect to a download. This is obviously related to the webdriver itself though and not fantoccini.
  2. Instruct fantoccini to tell the webdriver to cancel the previous action after x amount of time?

Thanks in advance!

commented

After looking around a bit, I've seen no clear solution for disabling it, in fact- it gets worse, apparently this would happen with any sort of browser prompt e.g.: push notifications, downloads, printing, HTTP Auth and so on. Not all of these can (from what I've found) be disabled, so whenever any of these prompts appear, fantoccini will be waiting for a response and will remain stuck there until it decides to timeout the connection.

My ideas:

  1. If fantoccini could simply return back an error after a timeout instead of killing the connection, perhaps a new .goto() on a different link would cancel these dialogs. Regardless, I'll at least attempt a full reconnect and try a .goto() with a different link when this problem occurs, I'm hopeful that the dialog vanishes and it continues on its merry way.

  2. Since pressing e.g. ESC manually gets rid of the prompts I've so far tried, if that could be done programmatically somehow it could be an option, but I don't know the extent of the webdriver API and if it has any such control.

commented

Okay, I've narrowed down which timeout causes fantoccini to drop it. It's the pageLoad timeout, which by default is 5 minutes. I've now set it to 5 seconds for testing, and I'm getting the same error after those 5 seconds. Unless fantoccini is really just getting thrown out after that timeout hits, it might be a bug. The geckodriver debug is clean.

If it is a bug, would it be possible to make fantoccini simply return the error without destroying the connection? As far as I can tell, the geckodriver and the session within, is still running. Though, I can't be sure if it's still usable- I'm only assuming a .goto() after would invalidate the previous request, well I'm hoping so. But I've been unable to test this as I can't reconnect to the same session, but I saw this #100 which I might try.