Page#to_absolut raises URI::InvalidURIError: path conflicts with opaque
buren opened this issue Β· comments
Thanks for an awesome gem! π
When crawling a site, this exception was raised:
URI::InvalidURIError: path conflicts with opaque
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:761:in `check_path'
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:817:in `path='
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in `to_absolute'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:in `block in each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `upto'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `each'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:in `each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in `each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:in `block in visit_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in `block in get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:in `prepare_request'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in `get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:in `visit_page'
The line in spidr
that raises this error is lib/spidr/page/html.rb:283:
new_url.path = URI.expand_path(path)
URI#path=
calls URI#check_path
which raises the error, see ruby docs for URI:Generic#check_path.
I'm not really sure what the best way to go about this would be, perhaps catching URI::InvalidURIError
and returning nil
is sensible, since nil
already can be returned from Page#to_absolute
?
Just curious if you know the page and link that causes this exception?
@postmodern I'm sorry I don't have the exact URL :/ the only thing I know is that it happened on this domain https://www.bls.gov. Real sorry that I can't be more specific.
I encountered this as well and was eventually able to track down the cause.
Page#to_absolute
only calls URI::expand_path
if the path is not nil. mailto links normally have a nil path. However, there appears to be a bug in the Ruby URI module that returns an empty string rather than nil if a mailto link has headers but no to
component:
2.4.1 :011 > URI("mailto:a@b.com?a").path
=> nil
2.4.1 :012 > URI("mailto:a@b.com").path
=> nil
2.4.1 :013 > URI("mailto:?").path
=> ""
2.4.1 :014 > URI("mailto:?a").path
=> ""
This causes #to_absolute
to attempt to set the URI's path to URI.expand_path("")
(an empty string), which raises an exception because it's an opaque URI.
I'm seeing this bug in the wild with mailto URLs generated by the WordPress Custom Facebook Feed plugin; however, I can consistently replicate it with any mailto URL of that form.
I'm going to report this bug to the Ruby development team unless I find that it's been reported already. In the meantime, maybe it's best to directly check whether URI#opaque
is not nil. I'll submit a PR later this weekend if I have the time.
What happened with this issue? I'm still experimenting it
I never did remember to open that PR, did I? I already wrote the code; still have it lying around somewhere. Let me see if I can find it.
Merged!
Finally released in 0.6.1.