xuwenhao / boilerpipe

Automatically exported from code.google.com/p/boilerpipe

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Faulty XML encoding of characters in <script> tags in <head>

GoogleCodeExporter opened this issue · comments

What steps will reproduce the problem?
1. Run html with <script> tags in the <head> that contain characters like 
ampersands

What is the expected output? What do you see instead?

I would expect the scripts to survive verbatim


What version of the product are you using? On what operating system?

1.2.0


Please provide any additional information below.

As a workaround I changed:

//html.append(xmlEncode(String.valueOf(ch, start, length)));
html.append((String.valueOf(ch, start, length)));

in HTMLHighlighter.java


NOTE: I also changed the TAG_ACTIONS map to be empty, since our goal is to get 
a as verbatim as possible copy of the original HTML document with just small 
markers (class) on marked elements.. Short of emptying that map I could not 
figure out how to get the original <head> out of the document.

Original issue reported on code.google.com by tapa...@gmail.com on 14 Jan 2013 at 1:20