jlolling / talendcomp_tHashRow

This Talend component builds hash keys from various configurable columns. It is designed to support hash key generation for Data Vault scenarios.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Option to define String codepage

mattywausb opened this issue · comments

When hashing a string, the correct codepage of the string representation is essential. This should default to UTF-8 and more advanced shoud be changeable. If strings in Java are always a UTF-8, this behaviuor must be added to the documentation for clarity.

The code page of Strings in Java is always UTF-16. Anything else is only relevant if we read or write files or streams. I do not think so, we have to declare the code page of Strings.

The current implementation converts the string which will be used to calulate the hash to UTF-8 by default.
final byte[] result = messageDigest.digest(content.getBytes(Charset.forName("UTF-8")));

I have just enabled to configure this decoding.

Havent checked it yet, but already thank you. Thats great.

Sorry, I have started make it configurable but after thinking about I have doubt this would make sense.
A Java String is always encoded in UTF-16 and the only part of a job where the encoding is relevant is where we read bytes and make a String from it. This is not the case in this component. We do not read bytes and need to know the encoding.
Let as speak about this please.