karussell / snacktory

Readability clone in Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ensure asian characters are not broken

karussell opened this issue · comments

This is now fixed! But needs a unit test!

From email:

The issue is in Converter.streamToString(). There's a loop to read http data chunks. Each chunk is converted separately to String, but may contain only the first (or seconf) half of a character, thus result in corrupted data. It happens sporadically depending on timing.

Also, the counting of bytesRead was wrong, so for slow connection there may be a "size exceeded" message with no justification.

What I did to test this problem is reading a Japanese article (url below) with the Browser, save its content somewhere (e.g. on file). Then run the streamToString() function in a loop (with some delay) and each time compare its output with the expected output on file. Sometimes I experienced dozens successful tests and then several failures, so this is not too persistent but the errors were often enough.

The article I tested on is http://astand.asahi.com/magazine/wrscience/2012022900015.html, and the corruption was almost always visible in the string "300" (see in the article), where instead of the "3" some junk was displayed.