postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Page#to_absolut raises URI::InvalidURIError: path conflicts with opaque

buren opened this issue Β· comments

Thanks for an awesome gem! 🌟

When crawling a site, this exception was raised:

URI::InvalidURIError: path conflicts with opaque
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:761:in `check_path'
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:817:in `path='
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in `to_absolute'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:in `block in each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `upto'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `each'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:in `each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in `each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:in `block in visit_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in `block in get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:in `prepare_request'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in `get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:in `visit_page'

The line in spidr that raises this error is lib/spidr/page/html.rb:283:

new_url.path = URI.expand_path(path)

URI#path= calls URI#check_path which raises the error, see ruby docs for URI:Generic#check_path.

I'm not really sure what the best way to go about this would be, perhaps catching URI::InvalidURIError and returning nil is sensible, since nil already can be returned from Page#to_absolute?

Just curious if you know the page and link that causes this exception?

@postmodern I'm sorry I don't have the exact URL :/ the only thing I know is that it happened on this domain https://www.bls.gov. Real sorry that I can't be more specific.

I encountered this as well and was eventually able to track down the cause.

Page#to_absolute only calls URI::expand_path if the path is not nil. mailto links normally have a nil path. However, there appears to be a bug in the Ruby URI module that returns an empty string rather than nil if a mailto link has headers but no to component:

2.4.1 :011 > URI("mailto:a@b.com?a").path
 => nil
2.4.1 :012 > URI("mailto:a@b.com").path
 => nil
2.4.1 :013 > URI("mailto:?").path
 => ""
2.4.1 :014 > URI("mailto:?a").path
 => ""

This causes #to_absolute to attempt to set the URI's path to URI.expand_path("") (an empty string), which raises an exception because it's an opaque URI.

I'm seeing this bug in the wild with mailto URLs generated by the WordPress Custom Facebook Feed plugin; however, I can consistently replicate it with any mailto URL of that form.

I'm going to report this bug to the Ruby development team unless I find that it's been reported already. In the meantime, maybe it's best to directly check whether URI#opaque is not nil. I'll submit a PR later this weekend if I have the time.

What happened with this issue? I'm still experimenting it

I never did remember to open that PR, did I? I already wrote the code; still have it lying around somewhere. Let me see if I can find it.

Merged!

Finally released in 0.6.1.