Tickaroo / tikxml

Modern XML Parser for Android

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for iso-8859-1

qLag opened this issue Β· comments

I try to parse an XML that comes from an iso-8859-1 API (somes strings have french accents).

Unfortunately, tikXml seems only to work with UTF-8.

I tried to use a TypeConverter :

`class StringUT8Converter : TypeConverter {

override fun read(value: String): String {
    return String(value.toByteArray(Charsets.ISO_8859_1), Charsets.UTF_8)
}

override fun write(value: String): String {
    return String(value.toByteArray(Charsets.UTF_8), Charsets.ISO_8859_1)

}

}
`
but it doesn't work.

Do you think you can include other encodings than UTF-8 (for poor old webservices 😝 )?

Thx

I didn't try but is it not possible to use buffer.read(_index_, this.charset) instead of buffer.readUtf!(_index_) in all the file XmlReader (with a default charset = Charsets.UTF_8)

And give the the possibility to define a custom Charset with TikXmlConfig that will be used in TikXml.java :
XmlReader reader = XmlReader.of(source, config.charset);

@qLag It is possible, Okio's API allows you to provide a charset with Buffer#readString(), Buffer#writeString(), and ByteString#encodeString()

I think the only issue is skipping the leading BOM for each charset. This is the current implementation.

private int nextNonWhitespace(boolean throwOnEof, boolean isDocumentBeginning) throws IOException {
  // Look for UTF-8 BOM sequence 0xEFBBBF and skip it
  if (isDocumentBeginning && source.rangeEquals(0, UTF8_BOM)) {
    source.skip(3);
  }
  ...
}

Not sure if this is the most optimal way to support skipping the BOM for each charset, but here's how OkHttp does it for several UTF charsets.
https://github.com/square/okhttp/blob/3f946d0b13534bcd1662e58624b0fc5816d1cc14/okhttp/src/main/java/okhttp3/internal/Util.kt#L255-L265

Edit:
FWIW, Moshi doesn't skip the BOM, you have to detect it and skip it yourself before handing the stream to Moshi. Perhaps that is another avenue of approach.

I made a draft here #150, needs unit tests but I went ahead and started the leg work.

Hi reline,
Thank for your support :) I will please to test your feature when it gets ready πŸ‘

@qLag In the meantime you can always build a snapshot off of that branch if it's urgent and meets your needs.
I'd like to get more feedback from the maintainers now.

Hi reline,

I tried your draft using this line in Gradle :
implementation 'com.github.reline:tikxml:iso-8859-1-SNAPSHOT'

And this in my code :
val tikXml = TikXml.Builder() .charset(Charsets.ISO_8859_1) .exceptionOnUnreadXml(false) .build()

And... it works great ! πŸ‘ 😊 πŸŽ‰
I needed to add these lines too in my build.gralde to make it work :
packagingOptions { exclude 'META-INF/gradle/incremental.annotation.processors' }

Its a really good new. How can we proceed now to be included in Tickaroo/tikXML ?
Thanks again :)

Qlag

@qLag Glad that worked for you!

I updated the PR with some unit tests, only significant difference I made was fixing the XML declaration when writing in charsets other than UTF-8.

- XML_DECLARATION = ByteString.encodeUtf8("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
+ XML_DECLARATION = ByteString.encodeString("<?xml version=\"1.0\" encoding=\"" + charset.name() + "\"?>", charset);

Is anyone available to review it? @sockeqwe @Bodo1981