sparklemotion / mechanize

Mechanize is a ruby library that makes automated web interaction easy.

Home Page:https://www.rubydoc.info/gems/mechanize/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mechanize + nokogiri --> mechanize + oga (or better: mechanize + any other html parser)

muescha opened this issue · comments

i use mechanize which requires nokogiri (which is a PITA to install)

is it possible to have a mechanize + oga?

or better more generic: mechanize + any other html parser

PS:
requested before in: yorickpeterse/oga#106

It's certainly possible, and the HTML parser is actually technically pluggable. That said, Mechanize relies on the Nokogiri API so I think this would be non-trivial. In the case of Oga, you could write a proxy parser. Something like this:

require 'mechanize'
require 'oga'

class OgaParser
  extend Forwardable

  attr_reader :errors, :doc

  def_delegators :doc, :css, :xpath, :at_css, :at_xpath

  def initialise
    # actually use this
    @errors = []
  end

  def parse(*args)
    # TODO handle rest of the args (encoding, etc)
    @doc ||= Oga.parse_html(args.shift)
    self
  end

  def search(pattern)
    css(pattern)
  rescue LL::ParserError
    # lol wtf don't do this
    xpath(pattern)
  end
end

class Oga::XML::NodeSet
  alias inner_text text
end

Mechanize.html_parser = OgaParser.new

mech = Mechanize.new
page = mech.get "https://github.com"

p page.title
p page.at_xpath("//body").attributes

Unfortunately as you can see it requires some monkey patching and awkward exception handling. I think the Mechanize code would need some significant refactoring and probably come bundled with these adapters for handling different HTML parsers.

I would love to one day see mechanize support oga instead of nokogiri. I can't volunteer yet, as I'm still a Ruby novice, and I'm busy with school. Perhaps by Q3 2017 I might try to do patches towards this goal, if no one else has made progress with this.

Still there is a runtime depency in gemspec so that i can not install mechanize without installing nokogiri (which fails on my macbook B/c libxml and others)

@leejarvis fixed many issues with your implementation. Unfortunately, oga fails with many non-standard HTML...

require 'mechanize'
require 'oga'

##
# Analogue to Nokogiri::HTML
#
class OgaParser
  extend Forwardable

  attr_reader :doc

  def_delegators :doc, :css, :xpath, :at_css, :at, :at_xpath, :to_html, :to_xml

  def parse string_or_io, url = nil, encoding = nil
    html = string_or_io.encode 'UTF-8', invalid: :replace, undef: :replace, replace: ''

    # TODO handle rest of the args (encoding, etc)
    Oga.parse_html html
  end

end

module Oga::XML
  class NodeSet
    alias_method :inner_text, :text
  end

  class Document
    alias_method :at, :at_css
    alias_method :to_html, :to_xml

    def text
      children.text
    end

    def errors
      []
    end

    def search pattern
      css pattern
    rescue LL::ParserError
      # lol wtf don't do this
      xpath(pattern)
    end
  end

  class Element
    alias_method :at, :at_css

    def attr name
      a = attribute name
      if a then a.value else '' end
    end
  end
end

Mechanize.html_parser = OgaParser.new

fixed many issues with your implementation. Unfortunately, oga fails with many non-standard HTML...

Thanks. Yeah I don't think it's reasonable to try and pull this into Mechanize in any official capacity, so I'm going to close this issue