nxbdi / jsunpack-n

Automatically exported from code.google.com/p/jsunpack-n

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Active mode infinite loop caused by unescaped "gathered" URLs

GoogleCodeExporter opened this issue · comments

This is a similar problem, but different fix, as the previously noted infinite 
loop issue in active mode.  

In this case the two causes appear to be:
1) requested URL includes an html escapable character (i.e '<','>','&') 
2) the request html object includes an escaped version of the URL requested 
(i.e. if the requested URL was www.test.com/testpage.html?a=1&b=2, then the 
page would include a link on the page to 
www.test.com/testpage.html?a=1&amp;b=2, or some variation that includes &amp; 
in the link)

The loop is created because jsunpack-n doesn't recognize the second URL with 
&amp; instead of & the same as the first URL already fetched, so it makes 
another request for the "new" link, which then returns yet another link 
including &amp;amp; (because the first & is again escaped by the server.  This 
goes on until you run out of memory or patience.  

You can recreate this (at this moment) with the following command and url: 
./jsunpackn.py -au 
'http://search.twitter.com/search?q=Seth+s+Blog+Where+Do+Ideas+Come+From'

I've attached a diff that also includes the previous fix for issue #3 
(https://code.google.com/p/jsunpack-n/issues/detail?id=3)

The fix is a little gludgy ATM, it would be better if all urls are created and 
cleaned in one method, rather than adhoc for each type.

Also, I noticed that there are timeouts for time, but I may go ahead and add a 
"max-depth" parameter as well for active mode, as this would address a 
different set of issues than time timeouts.

Original issue reported on code.google.com by ryanwsm...@gmail.com on 29 Nov 2010 at 8:09

Attachments:

Weak sauce, it appears you can't edit after you've submit:

The other issue should be issue #4, rather than issue #3, but I'm sure everyone 
could've caught that as well

Original comment by ryanwsm...@gmail.com on 29 Nov 2010 at 8:12

Hi Ryanwsmith,

After i use urlEscapingFix.diff to patch jusnpack-n,
I found in active mode still infinite loop. 
for examploe:
jsunpackn.py -au http://www.barefeetshoes.com/

Original comment by Hsiao.ch...@gmail.com on 13 Jan 2011 at 6:38

Hey Chris,

Indeed you are correct, there are still outstanding issues, but with different 
root causes ( see the description before the patch.  I was knocking each case 
out on by on to build a viable crawling function, but the lack of initial 
response made me think my time was better spent elsewhere.  If more are 
interested I'd commit to releasing a patch for a more completely tested 
crawling module.

Original comment by ryanwsm...@gmail.com on 13 Jan 2011 at 4:20