mechanize + nokogiri --> mechanize + oga (or better: mechanize + any other html parser)
muescha opened this issue · comments
i use mechanize which requires nokogiri (which is a PITA to install)
is it possible to have a mechanize + oga?
or better more generic: mechanize + any other html parser
PS:
requested before in: yorickpeterse/oga#106
It's certainly possible, and the HTML parser is actually technically pluggable. That said, Mechanize relies on the Nokogiri API so I think this would be non-trivial. In the case of Oga, you could write a proxy parser. Something like this:
require 'mechanize'
require 'oga'
class OgaParser
extend Forwardable
attr_reader :errors, :doc
def_delegators :doc, :css, :xpath, :at_css, :at_xpath
def initialise
# actually use this
@errors = []
end
def parse(*args)
# TODO handle rest of the args (encoding, etc)
@doc ||= Oga.parse_html(args.shift)
self
end
def search(pattern)
css(pattern)
rescue LL::ParserError
# lol wtf don't do this
xpath(pattern)
end
end
class Oga::XML::NodeSet
alias inner_text text
end
Mechanize.html_parser = OgaParser.new
mech = Mechanize.new
page = mech.get "https://github.com"
p page.title
p page.at_xpath("//body").attributes
Unfortunately as you can see it requires some monkey patching and awkward exception handling. I think the Mechanize code would need some significant refactoring and probably come bundled with these adapters for handling different HTML parsers.
I would love to one day see mechanize support oga instead of nokogiri. I can't volunteer yet, as I'm still a Ruby novice, and I'm busy with school. Perhaps by Q3 2017 I might try to do patches towards this goal, if no one else has made progress with this.
Still there is a runtime depency in gemspec so that i can not install mechanize without installing nokogiri (which fails on my macbook B/c libxml and others)
@leejarvis fixed many issues with your implementation. Unfortunately, oga fails with many non-standard HTML...
require 'mechanize'
require 'oga'
##
# Analogue to Nokogiri::HTML
#
class OgaParser
extend Forwardable
attr_reader :doc
def_delegators :doc, :css, :xpath, :at_css, :at, :at_xpath, :to_html, :to_xml
def parse string_or_io, url = nil, encoding = nil
html = string_or_io.encode 'UTF-8', invalid: :replace, undef: :replace, replace: ''
# TODO handle rest of the args (encoding, etc)
Oga.parse_html html
end
end
module Oga::XML
class NodeSet
alias_method :inner_text, :text
end
class Document
alias_method :at, :at_css
alias_method :to_html, :to_xml
def text
children.text
end
def errors
[]
end
def search pattern
css pattern
rescue LL::ParserError
# lol wtf don't do this
xpath(pattern)
end
end
class Element
alias_method :at, :at_css
def attr name
a = attribute name
if a then a.value else '' end
end
end
end
Mechanize.html_parser = OgaParser.new
fixed many issues with your implementation. Unfortunately, oga fails with many non-standard HTML...
Thanks. Yeah I don't think it's reasonable to try and pull this into Mechanize in any official capacity, so I'm going to close this issue