Support for iso-8859-1
qLag opened this issue Β· comments
I try to parse an XML that comes from an iso-8859-1 API (somes strings have french accents).
Unfortunately, tikXml seems only to work with UTF-8.
I tried to use a TypeConverter :
`class StringUT8Converter : TypeConverter {
override fun read(value: String): String {
return String(value.toByteArray(Charsets.ISO_8859_1), Charsets.UTF_8)
}
override fun write(value: String): String {
return String(value.toByteArray(Charsets.UTF_8), Charsets.ISO_8859_1)
}
}
`
but it doesn't work.
Do you think you can include other encodings than UTF-8 (for poor old webservices π )?
Thx
I didn't try but is it not possible to use buffer.read(_index_, this.charset)
instead of buffer.readUtf!(_index_)
in all the file XmlReader (with a default charset = Charsets.UTF_8)
And give the the possibility to define a custom Charset with TikXmlConfig that will be used in TikXml.java :
XmlReader reader = XmlReader.of(source, config.charset);
@qLag It is possible, Okio's API allows you to provide a charset with Buffer#readString(), Buffer#writeString(), and ByteString#encodeString()
I think the only issue is skipping the leading BOM for each charset. This is the current implementation.
private int nextNonWhitespace(boolean throwOnEof, boolean isDocumentBeginning) throws IOException {
// Look for UTF-8 BOM sequence 0xEFBBBF and skip it
if (isDocumentBeginning && source.rangeEquals(0, UTF8_BOM)) {
source.skip(3);
}
...
}
Not sure if this is the most optimal way to support skipping the BOM for each charset, but here's how OkHttp does it for several UTF charsets.
https://github.com/square/okhttp/blob/3f946d0b13534bcd1662e58624b0fc5816d1cc14/okhttp/src/main/java/okhttp3/internal/Util.kt#L255-L265
Edit:
FWIW, Moshi doesn't skip the BOM, you have to detect it and skip it yourself before handing the stream to Moshi. Perhaps that is another avenue of approach.
I made a draft here #150, needs unit tests but I went ahead and started the leg work.
Hi reline,
Thank for your support :) I will please to test your feature when it gets ready π
@qLag In the meantime you can always build a snapshot off of that branch if it's urgent and meets your needs.
I'd like to get more feedback from the maintainers now.
Hi reline,
I tried your draft using this line in Gradle :
implementation 'com.github.reline:tikxml:iso-8859-1-SNAPSHOT'
And this in my code :
val tikXml = TikXml.Builder() .charset(Charsets.ISO_8859_1) .exceptionOnUnreadXml(false) .build()
And... it works great ! π π π
I needed to add these lines too in my build.gralde to make it work :
packagingOptions { exclude 'META-INF/gradle/incremental.annotation.processors' }
Its a really good new. How can we proceed now to be included in Tickaroo/tikXML ?
Thanks again :)
Qlag
@qLag Glad that worked for you!
I updated the PR with some unit tests, only significant difference I made was fixing the XML declaration when writing in charsets other than UTF-8.
- XML_DECLARATION = ByteString.encodeUtf8("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
+ XML_DECLARATION = ByteString.encodeString("<?xml version=\"1.0\" encoding=\"" + charset.name() + "\"?>", charset);