Ygg01 / xml5_draft

Draft for the XML5 proposal.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DTD elimination

Ygg01 opened this issue · comments

commented

I think this is the safest way to be sure nothing weird happens. It bogs down parser implementation and we could allow .json to be loaded for entity replacement. Basically, the whole DTD seems like a bag security bugs waiting to happen. I'd rather support XML Schema and Relax NG than continue to support DTD.

If there is a huge requirement for Entity replacement being inlined into XML, I could see DTD Entities being somewhat justified, with provision to treat any references in entities as just strings and limit their length to like ten thousand characters. Example (from Wikipedia):

 <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">

when referenced in

<lolz>&lol9;</lolz>
 // Expands to:
<lolz>&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;</lolz>

What you think @annevk ?

My original approach was to only support inline DTDs because that matches what browsers support of XML today. Which would inline include entity support.

I also had the idea of extending the built-in entities to match the set of HTML.

commented

Perhaps, but how much do browsers use anything other than entity support?

Oh everything else would be ignored. But you still need to parse it properly I think. But maybe there's some shortcuts I didn't see the first time around.

commented

My plan of action from favorite to least favorite:
a) Simply ignore DTD, don't parse it and emit errors upon encountering
b) Allow a really tiny subset of DTD - Entity and whatever HTML5 supports basically.
c) Parse everything, return nothing
d) Parse everything, return DTD

As for shortcuts, I remember one hack in Fantom XML parser that assumes you didn't miss a < or >. Basically read until numbers of 'brackets' is equal.

// skip the rest of the doctype
depth := 1
while (true)
{
  c = read()
  if (c == '<') depth++
  if (c == '>') depth--
  if (depth == 0) return
}

However that's a hack and fear it's exploitable.

Since the focus is mainly browsers and nobody else really likes DTDs anyway I suggest we handle <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> (by ignoring it and emitting an error) and anything more complex will break (in what way is TBD, but ideally falls out of the remaining states).

We will support all HTML entities.

commented

Sounds, like a plan.

Although by all HTML entities, I assume you mean HTML/SVG/MathML entities?

Yeah, those defined by HTML 😛