Support for iso-8859-1

Question

Support for iso-8859-1

qLag opened this issue 4 years ago · comments

QUENTIN LAGARDE commented 4 years ago

I try to parse an XML that comes from an iso-8859-1 API (somes strings have french accents).

Unfortunately, tikXml seems only to work with UTF-8.

I tried to use a TypeConverter :

`class StringUT8Converter : TypeConverter {

override fun read(value: String): String {
    return String(value.toByteArray(Charsets.ISO_8859_1), Charsets.UTF_8)
}

override fun write(value: String): String {
    return String(value.toByteArray(Charsets.UTF_8), Charsets.ISO_8859_1)

}

}
`
but it doesn't work.

Do you think you can include other encodings than UTF-8 (for poor old webservices 😝 )?

Thx

QUENTIN LAGARDE · Answer 1 · Thu Nov 26 2020 02:49:03 GMT+0800 (China Standard Time)

I didn't try but is it not possible to use buffer.read(_index_, this.charset) instead of buffer.readUtf!(_index_) in all the file XmlReader (with a default charset = Charsets.UTF_8)

And give the the possibility to define a custom Charset with TikXmlConfig that will be used in TikXml.java :
XmlReader reader = XmlReader.of(source, config.charset);

Nathan Reline · Answer 2 · Thu Nov 26 2020 06:37:55 GMT+0800 (China Standard Time)

@qLag It is possible, Okio's API allows you to provide a charset with Buffer#readString(), Buffer#writeString(), and ByteString#encodeString()

I think the only issue is skipping the leading BOM for each charset. This is the current implementation.

private int nextNonWhitespace(boolean throwOnEof, boolean isDocumentBeginning) throws IOException {
  // Look for UTF-8 BOM sequence 0xEFBBBF and skip it
  if (isDocumentBeginning && source.rangeEquals(0, UTF8_BOM)) {
    source.skip(3);
  }
  ...
}

Not sure if this is the most optimal way to support skipping the BOM for each charset, but here's how OkHttp does it for several UTF charsets.
https://github.com/square/okhttp/blob/3f946d0b13534bcd1662e58624b0fc5816d1cc14/okhttp/src/main/java/okhttp3/internal/Util.kt#L255-L265

Edit:
FWIW, Moshi doesn't skip the BOM, you have to detect it and skip it yourself before handing the stream to Moshi. Perhaps that is another avenue of approach.

Nathan Reline · Answer 3 · Fri Nov 27 2020 06:54:19 GMT+0800 (China Standard Time)

I made a draft here #150, needs unit tests but I went ahead and started the leg work.

QUENTIN LAGARDE · Answer 4 · Mon Nov 30 2020 15:49:10 GMT+0800 (China Standard Time)

Hi reline,
Thank for your support :) I will please to test your feature when it gets ready 👍

Nathan Reline · Answer 5 · Thu Dec 03 2020 08:33:22 GMT+0800 (China Standard Time)

@qLag In the meantime you can always build a snapshot off of that branch if it's urgent and meets your needs.
I'd like to get more feedback from the maintainers now.

QUENTIN LAGARDE · Answer 6 · Thu Dec 17 2020 04:20:37 GMT+0800 (China Standard Time)

Hi reline,

I tried your draft using this line in Gradle :
implementation 'com.github.reline:tikxml:iso-8859-1-SNAPSHOT'

And this in my code :
val tikXml = TikXml.Builder() .charset(Charsets.ISO_8859_1) .exceptionOnUnreadXml(false) .build()

And... it works great ! 👍 😊 🎉
I needed to add these lines too in my build.gralde to make it work :
packagingOptions { exclude 'META-INF/gradle/incremental.annotation.processors' }

Its a really good new. How can we proceed now to be included in Tickaroo/tikXML ?
Thanks again :)

Qlag

Nathan Reline · Answer 7 · Wed Dec 23 2020 05:44:31 GMT+0800 (China Standard Time)

@qLag Glad that worked for you!

I updated the PR with some unit tests, only significant difference I made was fixing the XML declaration when writing in charsets other than UTF-8.

- XML_DECLARATION = ByteString.encodeUtf8("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
+ XML_DECLARATION = ByteString.encodeString("<?xml version=\"1.0\" encoding=\"" + charset.name() + "\"?>", charset);

Is anyone available to review it? @sockeqwe @Bodo1981