TextFormat::createImporter issues
kidnewthe opened this issue · comments
when I use FileSystem::native()->loadTextAutodetect
open a file, it return an error decode string.
the detected result is ply::TextFormat={encoding=Bytes (0) newLine=LF (0) bom=false }
that`s right.
I saw in TextFormat::createImporter
at
` OptionallyOwned importer;
if (this->encoding == TextFormat::Encoding::UTF8) {
importer = std::move(ins);
}
else {
importer = Owned<InStream>::create(Owned<InPipe_TextConverter>::create(
std::move(ins), TextEncoding::get<UTF8>(), encodingFromEnum(this->encoding)));
}
`
there TextFormat::Encoding::Bytes also forks with else, which I think is causing the problem.
I wrote a test program that you can verify.
Hello,
loadTextAutodetect
is designed to automatically convert the input file to UTF-8.
Your input file contained: 66 6a 69 5f 31 32 33 09 c4 e3 ba c3 0a 66 6a 69 65 5f 32 33 09 b2 e2 ca d4 0a
As you saw, loadTextAutodetect
detected this as TextFormat::Encoding::Bytes
. Currently, that means each byte is interpreted as a Unicode code point according to the value of that byte.
The String
returned from loadTextAutodetect
contains: 66 6a 69 5f 31 32 33 09 c3 84 c3 a3 c2 ba c3 83 0a 66 6a 69 65 5f 32 33 09 c2 b2 c3 a2 c3 8a c3 94 0a
That's actually correct because, for example, c3 84 is the UTF-8 encoding of Unicode code point c4. I don't really see a bug here.
Perhaps you want to load the file using FileSystem::loadBinary
instead. That will load the contents of your file into a String
exactly as it is on disk.
Thanks for your answer, I didn't realize what loadTextAutodetect
did. It gives me the feeling that it automatically parses the character type and then returns me the file content.
Maybe loadTextAutoconvertedToUTF8
would be a clearer name.