soulcutter / saxerator

A SAX-based XML parser for parsing large files into manageable chunks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nested DocumentFragments

exAspArk opened this issue · comments

Hi, I have a question:
Can we change somehow DocumentFragment#each method to stop parsing a source?
What do you think about new behavior that allows us to use it like this:

parser.for_tag(:item).each do |item|
  # where the xml contains <item><author><name>...</name></author></item>
  item.for_tag(:author).each do |author|
    puts author.to_h["name"]
  end
end

Maybe it's hard to implement and of course these changes break a compatibility. But I personally prefer to keep using Saxerator's objects while iterating instead of dealing with nested hashes until I really want it. What do you think? :)

Apologies for taking so long to reply - holidays and other things have been keeping me busy.

I do see some appeal to that style of API. It mimics the nested structure of XML itself, eliminates the need for some predicates such as within, and makes more-explicit the conversion to hash or string.

On the other hand, I think this approach clashes with the idea of processing the XML as a stream, holding as little of the document as possible in-memory. I'm having a hard time wrapping my brain around how it would be possible to implement this structure in a manner that doesn't require the outer block to keep the whole <item> in memory. Where <item> is small, that may be ok, but the API mimicking the structure might lead users to naively begin parsing with the root element, which would then attempt to hold the entire document in-memory (would like to prevent this from being possible). Nested blocks would need to do an extra traversal of the structure as well (probably no big deal, just working it out in my head).

This is an intriguing idea. I haven't been very actively working on this lately, but I'll leave this open for further consideration.

After consideration, I think this is outside the scope of this library.