Active mode infinite loop caused by unescaped "gathered" URLs
GoogleCodeExporter opened this issue · comments
Google Code Exporter commented
This is a similar problem, but different fix, as the previously noted infinite
loop issue in active mode.
In this case the two causes appear to be:
1) requested URL includes an html escapable character (i.e '<','>','&')
2) the request html object includes an escaped version of the URL requested
(i.e. if the requested URL was www.test.com/testpage.html?a=1&b=2, then the
page would include a link on the page to
www.test.com/testpage.html?a=1&b=2, or some variation that includes &
in the link)
The loop is created because jsunpack-n doesn't recognize the second URL with
& instead of & the same as the first URL already fetched, so it makes
another request for the "new" link, which then returns yet another link
including &amp; (because the first & is again escaped by the server. This
goes on until you run out of memory or patience.
You can recreate this (at this moment) with the following command and url:
./jsunpackn.py -au
'http://search.twitter.com/search?q=Seth+s+Blog+Where+Do+Ideas+Come+From'
I've attached a diff that also includes the previous fix for issue #3
(https://code.google.com/p/jsunpack-n/issues/detail?id=3)
The fix is a little gludgy ATM, it would be better if all urls are created and
cleaned in one method, rather than adhoc for each type.
Also, I noticed that there are timeouts for time, but I may go ahead and add a
"max-depth" parameter as well for active mode, as this would address a
different set of issues than time timeouts.
Original issue reported on code.google.com by ryanwsm...@gmail.com
on 29 Nov 2010 at 8:09
Attachments:
Google Code Exporter commented
Weak sauce, it appears you can't edit after you've submit:
The other issue should be issue #4, rather than issue #3, but I'm sure everyone
could've caught that as well
Original comment by ryanwsm...@gmail.com
on 29 Nov 2010 at 8:12
Google Code Exporter commented
Hi Ryanwsmith,
After i use urlEscapingFix.diff to patch jusnpack-n,
I found in active mode still infinite loop.
for examploe:
jsunpackn.py -au http://www.barefeetshoes.com/
Original comment by Hsiao.ch...@gmail.com
on 13 Jan 2011 at 6:38
Google Code Exporter commented
Hey Chris,
Indeed you are correct, there are still outstanding issues, but with different
root causes ( see the description before the patch. I was knocking each case
out on by on to build a viable crawling function, but the lack of initial
response made me think my time was better spent elsewhere. If more are
interested I'd commit to releasing a patch for a more completely tested
crawling module.
Original comment by ryanwsm...@gmail.com
on 13 Jan 2011 at 4:20