DTD elimination

Question

DTD elimination

Ygg01 opened this issue 9 years ago · comments

I think this is the safest way to be sure nothing weird happens. It bogs down parser implementation and we could allow .json to be loaded for entity replacement. Basically, the whole DTD seems like a bag security bugs waiting to happen. I'd rather support XML Schema and Relax NG than continue to support DTD.

If there is a huge requirement for Entity replacement being inlined into XML, I could see DTD Entities being somewhat justified, with provision to treat any references in entities as just strings and limit their length to like ten thousand characters. Example (from Wikipedia):

 <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">

when referenced in

<lolz>&lol9;</lolz>
 // Expands to:
<lolz>&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;</lolz>

What you think @annevk ?

Anne van Kesteren · Answer 1 · Wed Apr 08 2015 17:39:13 GMT+0800 (China Standard Time)

My original approach was to only support inline DTDs because that matches what browsers support of XML today. Which would inline include entity support.

I also had the idea of extending the built-in entities to match the set of HTML.

Ygg01 · Answer 2 · Wed Apr 08 2015 17:43:00 GMT+0800 (China Standard Time)

Perhaps, but how much do browsers use anything other than entity support?

Anne van Kesteren · Answer 3 · Wed Apr 08 2015 17:53:17 GMT+0800 (China Standard Time)

Oh everything else would be ignored. But you still need to parse it properly I think. But maybe there's some shortcuts I didn't see the first time around.

Ygg01 · Answer 4 · Wed Apr 08 2015 18:02:07 GMT+0800 (China Standard Time)

My plan of action from favorite to least favorite:
a) Simply ignore DTD, don't parse it and emit errors upon encountering
b) Allow a really tiny subset of DTD - Entity and whatever HTML5 supports basically.
c) Parse everything, return nothing
d) Parse everything, return DTD

As for shortcuts, I remember one hack in Fantom XML parser that assumes you didn't miss a < or >. Basically read until numbers of 'brackets' is equal.

// skip the rest of the doctype
depth := 1
while (true)
{
  c = read()
  if (c == '<') depth++
  if (c == '>') depth--
  if (depth == 0) return
}

However that's a hack and fear it's exploitable.

Anne van Kesteren · Answer 5 · Wed Apr 08 2015 18:08:39 GMT+0800 (China Standard Time)

Since the focus is mainly browsers and nobody else really likes DTDs anyway I suggest we handle <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> (by ignoring it and emitting an error) and anything more complex will break (in what way is TBD, but ideally falls out of the remaining states).

We will support all HTML entities.

Ygg01 · Answer 6 · Sat Apr 11 2015 19:22:20 GMT+0800 (China Standard Time)

Sounds, like a plan.

Although by all HTML entities, I assume you mean HTML/SVG/MathML entities?

Anne van Kesteren · Answer 7 · Sun Apr 12 2015 15:00:41 GMT+0800 (China Standard Time)

Yeah, those defined by HTML 😛