sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

Home Page:https://nokogiri.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[bug] SAX `start_element` behavior changed in libxml v2.12.0

flavorjones opened this issue · comments

Please describe the bug

Originally reported at searls/eiwa#10

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", "~>1.15.0"
end

class Document < Nokogiri::XML::SAX::Document
  def start_element(name, attrs)
    puts "#{__FILE__}:#{__LINE__}:#{__method__}: name=#{name.inspect}, attrs=#{attrs.inspect}"
  end
end

fixture = <<~XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
  <!ATTLIST foo xml:lang CDATA "eng">
]>
<root>
  <foo xml:lang="ger">Ja</foo>
</root>
XML

parser = Nokogiri::XML::SAX::Parser.new(Document.new)
parser.parse(fixture)

# with nokogiri < 1.16.0:
# ./10-sax-issue.rb:12:start_element: name="root", attrs=[]
# ./10-sax-issue.rb:12:start_element: name="foo", attrs=[["xml:lang", "ger"]]
# 
# with nokogiri >= 1.16.0:
# ./10-sax-issue.rb:12:start_element: name="root", attrs=[]
# ./10-sax-issue.rb:12:start_element: name="foo", attrs=[["xml:lang", "ger"], ["xml:lang", "eng"]]

Just confirming that this seems to be an upstream issue. I can reproduce it using xmllint and am going to git bisect.

Upstream commit is https://gitlab.gnome.org/GNOME/libxml2/-/commit/e0dd330b which first appeared in libxml 2.12.0

commit e0dd330b (HEAD)
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   2023-09-29 00:18:44 +0200

    parser: Use hash tables to avoid quadratic behavior

    Use a hash table to lookup namespaces by prefix. The hash table stores
    an index into the namespace table. Auxiliary data for namespaces is
    stored in a separate array along the main namespace table.

    Use a hash table to verify attribute uniqueness. The hash table stores
    an index into the attribute table.

    Reuse hash value from the dictionary to avoid computing them twice.

    See #346.

Linked issue is https://gitlab.gnome.org/GNOME/libxml2/-/issues/346

Fixed upstream in https://gitlab.gnome.org/GNOME/libxml2/-/commit/186562a182d2e27f90631d1a1f63ad5079fe62fb

Not sure whether Nick will make a release soon, but if not I can patch this fix into the vendored version in a bugfix release.

Fix released upstream in v2.12.6, working on a release for that (unrelated blockers exist so it may be a day or two).

Release imminent, please follow #3151