Add at least rudimentary DTD handling
frp opened this issue · comments
sxd-document does not work if the document has internal DTD.
Trying to parse the example document (taken from here):
<?xml version="1.0"?>
<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>
I get the following error: (37, [Expected("SYSTEM")])
. 37 is the position of the first [
.
Adding full parsing of internal DTD might be quite a big deal, but, for my purposes (parsing JMdict) just ignoring the DTD content would be fine.
Mentoring Instructions
The DTD grammar and tokens are defined in the XML spec.
Work for this will probably largely live in parse_document_type_declaration
:
Line 492 in 5e2253d
For now, we aren't really worried about amazing performance or even collecting the data, all we need to make sure of is that the parser can accept the input data.
A small example to add to the tests:
<?xml version="1.0"?>
<!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd">
<cXML />
Strangely, it kind of seems like this much of it should already be supported; I'm guessing that there's some small mistake in what the parser accepts.
OK, so I've been tinkering and learning. So far, I have found that the presence of a /
character is enough to cause breakage.
I traced the failure back to parse_one_quoted_value
Lines 385 to 387 in 5e2253d
I'm assuming that the 2nd try_parse!()
here is meant to consume all characters up until the token passed in (in this case either a '
or a "
), yet it's terminating prematurely at the /
.
I don't really see a reason for this, but at this superficial level I suspect the bug is actually in the peresil
crate which provides the try_parse!()
macro. I may pivot over to write some tests for peresil
and see if I can reproduce with a simple case.
It's actually because the grammar for parse_external_id
got mistranslated:
commit 7e6b99340b40a06ec938c9d12b96d103c4e4a5ac
Author: Jake Goulding <jake.goulding@gmail.com>
Date: Wed Mar 7 19:31:17 2018 -0500
DTD SystemLiterals can contain any character
diff --git a/src/parser.rs b/src/parser.rs
index 995ba0d..b645f9c 100644
--- a/src/parser.rs
+++ b/src/parser.rs
@@ -481,8 +481,10 @@ fn parse_external_id<'a>(pm: &mut XmlMaster<'a>, xml: StringPoint<'a>)
let (xml, _) = try_parse!(xml.expect_literal("SYSTEM"));
let (xml, _) = try_parse!(xml.expect_space());
let (xml, external_id) = try_parse!(
- parse_quoted_value(pm, xml, |_, xml, _| xml.consume_name().map_err(|_| SpecificError::ExpectedSystemLiteral))
- );
+ parse_quoted_value(pm, xml, |_, xml, quote| {
+ xml.consume_attribute_value(quote).map_err(|_| SpecificError::ExpectedSystemLiteral)
+ })
+ );
success(external_id, xml)
}
@@ -1311,11 +1313,11 @@ mod test {
#[test]
fn a_prolog_with_a_document_type_declaration() {
- let package = quick_parse("<?xml version='1.0'?><!DOCTYPE doc SYSTEM \"doc.dtd\"><hello/>");
+ let package = quick_parse(r#"<?xml version="1.0"?><!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.014/cXML.dtd"><cXML />"#);
let doc = package.as_document();
let top = top(&doc);
- assert_qname_eq!(top.name(), "hello");
+ assert_qname_eq!(top.name(), "cXML");
}
#[test]
Oh, mm. 😊 I guess I was about to go on a wild goose chase.
Taking almost verbatim what was shown here, the test is passing.
In order to not be a total plagiarist, I'm going to try and work up a solution for the OP's case. Worth a shot at least!
I've added the example from @frp as a test (which is now passing). Not sure what else I can do outside of addressing any nitpicks.