shepmaster / sxd-document

An XML library in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add at least rudimentary DTD handling

frp opened this issue · comments

commented

sxd-document does not work if the document has internal DTD.

Trying to parse the example document (taken from here):

<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>

I get the following error: (37, [Expected("SYSTEM")]). 37 is the position of the first [.
Adding full parsing of internal DTD might be quite a big deal, but, for my purposes (parsing JMdict) just ignoring the DTD content would be fine.

Mentoring Instructions

The DTD grammar and tokens are defined in the XML spec.

Work for this will probably largely live in parse_document_type_declaration:

fn parse_document_type_declaration<'a>(pm: &mut XmlMaster<'a>, xml: StringPoint<'a>) -> XmlProgress<'a, Token<'a>> {

For now, we aren't really worried about amazing performance or even collecting the data, all we need to make sure of is that the parser can accept the input data.

A small example to add to the tests:

<?xml version="1.0"?>
<!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd">
<cXML />

Strangely, it kind of seems like this much of it should already be supported; I'm guessing that there's some small mistake in what the parser accepts.

OK, so I've been tinkering and learning. So far, I have found that the presence of a / character is enough to cause breakage.

I traced the failure back to parse_one_quoted_value

sxd-document/src/parser.rs

Lines 385 to 387 in 5e2253d

let (xml, _) = try_parse!(xml.consume_literal(quote).map_err(|_| Error::ExpectedOpeningQuote(quote)));
let (xml, value) = try_parse!(f(xml));
let (xml, _) = try_parse!(xml.consume_literal(quote).map_err(|_| Error::ExpectedClosingQuote(quote)));

I'm assuming that the 2nd try_parse!() here is meant to consume all characters up until the token passed in (in this case either a ' or a "), yet it's terminating prematurely at the /.

I don't really see a reason for this, but at this superficial level I suspect the bug is actually in the peresil crate which provides the try_parse!() macro. I may pivot over to write some tests for peresil and see if I can reproduce with a simple case.

It's actually because the grammar for parse_external_id got mistranslated:

commit 7e6b99340b40a06ec938c9d12b96d103c4e4a5ac
Author: Jake Goulding <jake.goulding@gmail.com>
Date:   Wed Mar 7 19:31:17 2018 -0500

    DTD SystemLiterals can contain any character

diff --git a/src/parser.rs b/src/parser.rs
index 995ba0d..b645f9c 100644
--- a/src/parser.rs
+++ b/src/parser.rs
@@ -481,8 +481,10 @@ fn parse_external_id<'a>(pm: &mut XmlMaster<'a>, xml: StringPoint<'a>)
     let (xml, _) = try_parse!(xml.expect_literal("SYSTEM"));
     let (xml, _) = try_parse!(xml.expect_space());
     let (xml, external_id) = try_parse!(
-        parse_quoted_value(pm, xml, |_, xml, _| xml.consume_name().map_err(|_| SpecificError::ExpectedSystemLiteral))
-        );
+        parse_quoted_value(pm, xml, |_, xml, quote| {
+            xml.consume_attribute_value(quote).map_err(|_| SpecificError::ExpectedSystemLiteral)
+        })
+    );
 
     success(external_id, xml)
 }
@@ -1311,11 +1313,11 @@ mod test {
 
     #[test]
     fn a_prolog_with_a_document_type_declaration() {
-        let package = quick_parse("<?xml version='1.0'?><!DOCTYPE doc SYSTEM \"doc.dtd\"><hello/>");
+        let package = quick_parse(r#"<?xml version="1.0"?><!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd"><cXML />"#);
         let doc = package.as_document();
         let top = top(&doc);
 
-        assert_qname_eq!(top.name(), "hello");
+        assert_qname_eq!(top.name(), "cXML");
     }
 
     #[test]

Oh, mm. 😊 I guess I was about to go on a wild goose chase.

Taking almost verbatim what was shown here, the test is passing.

In order to not be a total plagiarist, I'm going to try and work up a solution for the OP's case. Worth a shot at least!

I've added the example from @frp as a test (which is now passing). Not sure what else I can do outside of addressing any nitpicks.