jukka / tagsoup

Fork of the TagSoup library by John Cowan

Home Page:http://ccil.org/~cowan/XML/tagsoup/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tags inside the title are not handled properly.

GerardBouchar opened this issue · comments

Currently, when given the following HTML,

<html>
<head>
<title>title with a <b>tag</b> in it</title>
</head>
<body></body>
</html>

tagsoup creates a DOM tree corresponding to the following:

<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>title with a </title></head><body><b>tag</b> in it

</body><body></body></html>

The title is trimmed. This causes TIKA-2700.

The generated DOM should correspond to:

<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>title with a &lt;b&gt;tag&lt;/b&gt; in it</title></head><body></body></html>