TextFormat::createImporter issues

Question

TextFormat::createImporter issues

kidnewthe opened this issue 2 years ago · comments

when I use FileSystem::native()->loadTextAutodetect open a file, it return an error decode string.
the detected result is ply::TextFormat={encoding=Bytes (0) newLine=LF (0) bom=false } that`s right.
I saw in TextFormat::createImporterat
` OptionallyOwned importer;

if (this->encoding == TextFormat::Encoding::UTF8) {
    importer = std::move(ins);
} 
else {
    importer = Owned<InStream>::create(Owned<InPipe_TextConverter>::create(
        std::move(ins), TextEncoding::get<UTF8>(), encodingFromEnum(this->encoding)));
}

`
there TextFormat::Encoding::Bytes also forks with else, which I think is causing the problem.

I wrote a test program that you can verify.

Jeff Preshing · Answer 1 · Thu Jun 30 2022 08:13:30 GMT+0800 (China Standard Time)

Hello,

loadTextAutodetect is designed to automatically convert the input file to UTF-8.

Your input file contained: 66 6a 69 5f 31 32 33 09 c4 e3 ba c3 0a 66 6a 69 65 5f 32 33 09 b2 e2 ca d4 0a

As you saw, loadTextAutodetect detected this as TextFormat::Encoding::Bytes. Currently, that means each byte is interpreted as a Unicode code point according to the value of that byte.

The String returned from loadTextAutodetect contains: 66 6a 69 5f 31 32 33 09 c3 84 c3 a3 c2 ba c3 83 0a 66 6a 69 65 5f 32 33 09 c2 b2 c3 a2 c3 8a c3 94 0a

That's actually correct because, for example, c3 84 is the UTF-8 encoding of Unicode code point c4. I don't really see a bug here.

Perhaps you want to load the file using FileSystem::loadBinary instead. That will load the contents of your file into a String exactly as it is on disk.

kidnewthe · Answer 2 · Thu Jun 30 2022 17:23:17 GMT+0800 (China Standard Time)

Thanks for your answer, I didn't realize what loadTextAutodetect did. It gives me the feeling that it automatically parses the character type and then returns me the file content.

Jeff Preshing · Answer 3 · Sat Jul 02 2022 02:07:19 GMT+0800 (China Standard Time)

Maybe loadTextAutoconvertedToUTF8 would be a clearer name.