Tags inside the title are not handled properly.
GerardBouchar opened this issue · comments
Currently, when given the following HTML,
<html>
<head>
<title>title with a <b>tag</b> in it</title>
</head>
<body></body>
</html>
tagsoup creates a DOM tree corresponding to the following:
<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>title with a </title></head><body><b>tag</b> in it
</body><body></body></html>
The title is trimmed. This causes TIKA-2700.
The generated DOM should correspond to:
<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>title with a <b>tag</b> in it</title></head><body></body></html>