Extremely large documents

Question

Extremely large documents

infogulch opened this issue 3 years ago · comments

Hi! I recently got into my head an idea to play around with offline copies of Wikipedia. The Wikimedia foundation very helpfully provides downloadable dumps of the full content of Wikipedia. The dump itself is a 20GB file that unzips to one .xml file that measures 78 GB in size. You saw that right: 1 xml file, 84,602,863,258 bytes. You can probably see where this is going...

Alas, I am many gigabytes short of fitting this entire mountain of a document in memory at once, let alone twice plus overhead (as a string and parsed). If I have any hope of consuming this thing with precision (as opposed to regex shudder) I believe a streaming parser and query engine will be necessary, however I did not see a streaming interface in the xsd_document::parser docs or in sxd_xpath; is that a correct assessment? Have you considered building a streaming interface to handle such cases? (In my experience, opting for a streaming solution can lead to the fastest implementation even when memory pressure is not a concern; such a use-case may be a valuable just to reach for speed.)

Thoughts?

Jake Goulding · Answer 1 · Sun Aug 01 2021 09:14:04 GMT+0800 (China Standard Time)

I am many gigabytes short of fitting this entire mountain of a document in memory at once

Heh, yeah. A document-based interface will almost never work for such an XML file. 😇

I did not see a streaming interface

That's correct. There's a pseudo-streaming interface buried in there, but it's not public and it's not really a great choice to make public.

The good news is that I've been slowly working on a ground-up rewrite that I believe will be measurably faster (early benchmarks show that it has the possibility of being on par with libxml2). The bad news is that it's so early that I haven't even done any kind of release for it.

If you have a bit of free time, I'd really appreciate it if you could clone it. There's a demo utility that will simply run the parser as fast as it can, counting the number of tokens it sees. I'd love to know what it says for your Very Big File:

time QUIET=1 cargo run --release path/to/your/file.xml

Future

Supporting some kind of Serde-like annotations would be very neat. Combined with the concept of streaming, I'd love to have some pseudo-Rust like this:

#[derive(sxd::Deserialize)]
struct Page {
    #[sxd::attribute]
    id: String,
    content: Vec<sxd::Value>,
}

let mut p = Parser::new();
p.enter_element();
let pg = p.deserialize::<Page>();

For some pseudo-XML:

<wrapper>
  <page id="abc"></page> <!-- repeated -->
</wrapper>

Potentially some other things to allow access to the string interning that would exist to reduce the number of copies further.

Joe Taber · Answer 2 · Tue Aug 03 2021 01:44:27 GMT+0800 (China Standard Time)

Cool, yes I'll try to check it out this week and report back my results.

I think your future design would work fine in my case:

let mut p = Parser::new(file);
p.enter_element();
let pg = p.deserialize::<Page>();

I think it would be fun (for some definition of "fun" 😇) if we could generate a rust types straight from the xsd
for documents with a well-defined schema. Take the wikipedia dump schema for example export-0.10.xsd:

	<!-- Our root element -->
	<element name="mediawiki" type="mw:MediaWikiType">
            ...
	</element>

	<complexType name="MediaWikiType">
		<sequence>
			<element name="siteinfo" type="mw:SiteInfoType"
					 minOccurs="0" maxOccurs="1" />
			<element name="page" type="mw:PageType"
					 minOccurs="0" maxOccurs="unbounded" />
			<element name="logitem" type="mw:LogItemType"
					 minOccurs="0" maxOccurs="unbounded" />
		</sequence>
		<attribute name="version" type="string" use="required" />
		<attribute ref="xml:lang" use="required" />
	</complexType>

I assume this means that the root is a <mediawiki /> element that contains up to 1 <siteinfo /> element, any number of <page /> elements and then any number of <logitem /> elements. Perhaps this would translate into a Iterable of an Enum with mw:SiteInfoType/mw:PageType/mw:LogItemType variants. (?) Then again it may not be possible to automatically derive an efficient/streamable rust type that matches an arbitrary xml schema. (Or it would be so unwieldly it would be painful to actually use.) So maybe this idea can be relaxed to just xsd schema + xpath query = rust type, checked at compile time. Just trying to push compile-time checks as far as possible. 😄 What do you think is feasible in this direction / what would the biggest roadblocks be?

Potentially some other things to allow access to the string interning that would exist to reduce the number of copies further.

If it's intended to be streamed, perhaps iterable elements can be borrowed and only valid during the inner iteration which could allow the whole stream to be zero-copy by filling the data structure with pointers directly into an underlying buffer.

Jake Goulding · Answer 3 · Tue Aug 03 2021 02:36:28 GMT+0800 (China Standard Time)

generate a rust types straight from the xsd
for documents with a well-defined schema

Yep, completely agree. Amusingly, it shouldn't be terrible to implement the first few passes of that. Basically you'd parse the XML for the XSD and then generate some Rust code based on that. I'm sure there are gotchas (circular datastructures is the first that comes to mind).

iterable elements can be borrowed and only valid during the inner iteration

This is how the internals of the new parser are implemented, but it's not really possible for Rust to express this in a generic manner right now (that requires generic associated types).

One interesting thing is that the command I suggested you run uses two fixed buffers of 16 MiB each (input, output). There are some environment variables you can set to adjust those. I'd expect that you could set it down to 1 KiB without much performance impact, and even down to 16 bytes and still be functional.

Another problem comes about exactly in your case. If your buffer is N bytes and you want to look at something that is N+1 bytes, there's no way to do it.

The revised parser approaches this by yielding values like Token::ElementName(Streaming::Incomplete(...)). It's then up to the consumer to handle those in an efficient manner.

We use the string interning to handle things like ensuring close tags match open tags and that attributes aren't repeated.

Joe Taber · Answer 4 · Wed Aug 04 2021 07:39:05 GMT+0800 (China Standard Time)

Running under WSL with cargo 1.54.0 (5ae8d74b3 2021-06-22):

 ➜ time QUIET=1 cargo run --release /mnt/c/projects/static.wiki/enwiki-20210720-pages-articles-multistream.xml
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/sxd /mnt/c/projects/static.wiki/enwiki-20210720-pages-articles-multistream.xml`
Parsed 3313537136 tokens

real    15m15.445s
user    6m54.255s
sys     0m11.748s

It read from the SSD drive at about 100MB/s.

Joe Taber · Answer 5 · Wed Aug 04 2021 09:12:46 GMT+0800 (China Standard Time)

parse the XML for the XSD and then generate some Rust code based on that

it's not really possible for Rust to express this in a generic manner right now (that requires generic associated types).

If you're generating the specific rust code for an xsd do you really need GAT? Or string interning for tags and attributes?

Jake Goulding · Answer 6 · Wed Aug 04 2021 22:14:16 GMT+0800 (China Standard Time)

real 15m15.445s
user 6m54.255s
sys 0m11.748s

I'm a bit surprised that user + sys doesn't add up to real...

I'm of mixed emotions here. 100 MB/s is pretty reasonable to me — how do you feel about that rough speed?

For the use case of parsing XML from over the network, that should be fast enough here in 2021. It's not fully saturating the IO of a local disk, however, so there might be some tweaks to improve that speed for cases like yours.

generating the specific rust code for an xsd

Perhaps. It's all about at what level the generated code operates at and what it has to reimplement. For example, the validation layer (e.g. opening and closing element names match, no duplicate attribute names, etc.) uses string interning. The generated code could reimplement that at the cost of... reimplementing it.

I haven't done deep thinking on this :-)

Joe Taber · Answer 7 · Wed Aug 04 2021 22:40:39 GMT+0800 (China Standard Time)

I'm a bit surprised that user + sys doesn't add up to real...

Good observation. Perhaps due to blocking on IO? I'll try with some different buffer sizes; 16 MB is pretty big but maybe the issue is waiting for the IO to go through, perhaps there's an opportunity to queue up the next buffer read while we're consuming the current one. Another complication is reading through the WSL syscall emulation, I'll also try running directly on windows. Unfortunately I don't have a linux set up on this system.

Jake Goulding · Answer 8 · Wed Aug 04 2021 23:36:20 GMT+0800 (China Standard Time)

A friend mentioned:

run either natively on windows or fully inside wsl, not mounting files through the 9p share

I'm a macOS user, and my WSL knowledge is light at best. Sounds like straddling the boundary might cause some degradation though.

I'm not sure of the time equivalent for native Windows, however 😉

Joe Taber · Answer 9 · Sat Aug 07 2021 09:58:44 GMT+0800 (China Standard Time)

That did seem to help. Seemed to read at about 150MB/s this time.

 ➜ time QUIET=1 cargo run --release enwiki-20210720-pages-articles-multistream.xml
    Finished release [optimized + debuginfo] target(s) in 0.15s
     Running `target/release/sxd enwiki-20210720-pages-articles-multistream.xml`
Parsed 3313537136 tokens

real    9m5.861s
user    6m31.176s
sys     0m21.371s

Either way is fast enough for me, but what I'm looking for is a way to run xpath queries on it.

Joe Taber · Answer 10 · Mon Aug 09 2021 02:17:59 GMT+0800 (China Standard Time)

Just in case you are unaware, there is another rust streaming xml parser named xml-rs. It reads at 3MB/s though xD