Non-Crockford's Base32 letters converted differently in Java or Python implementations
eberbis opened this issue · comments
Hi Andrew,
first of all, thanks for the amazing library, we've been using a lot!
I have a doubt regarding how we fix the conversion of ULIDs which are not following Crockford's Base32 standard.
We are using Lua to generate some guids (https://github.com/Tieske/ulid.lua) and for some reason, we get from time to time letters outside the Crockford's Base32.
While trying to fix this on our side (we're not sure how this is happening to be honest), we realised that Java and Python implementations silently corrects this issue in different ways:
Java
ULID.Value ulidValueFromString = ULID.parseULID("01BX73KC0TNH409RTFD1JXKmO0")
--> "01BX73KC0TNH409RTFD1JXKM00"
mO
is silently converted into M0
Python
In [1]: import ulid
In [2]: u = ulid.from_str('01BX73KC0TNH409RTFD1JXKmO0')
In [3]: u
Out[3]: <ULID('01BX73KC0TNH409RTFD1JXKQZ0')>
In [4]: u.str
Out[4]: '01BX73KC0TNH409RTFD1JXKQZ0'
mO
is silently converted into QZ
Shouldn't the python library behave as the Java one as per the Crockford's Base32 spec, converting L
and I
to 1
and O
to 0
and only upper casing lower case letters instead of changing them?
Thanks a lot in advance!
Eddie
Eddie,
Thanks for reporting the issue. I've got a couple ideas but this will need some deeper investigation. No promises but I should have some free time to dive-in within the next couple of days.
Best,
Andrew
Eddie,
Thanks again for reporting this and I apologize for the issue!
I merged in a fix and pushed out version 0.0.5
that should address it.
Best,
Andrew
Thanks Andrew, that was super fast!
Andrew,
sorry to be a pain.
I was testing those cases and they now work perfectly (thanks again for being so fast!).
While doing so, I think, I may have found a related issue. I saw that when decoding u
/U
values, in our version we silently convert those to a new value. This is an example:
In [6]: u = ulid.from_str('01BX73KC0TNH409RTFD1UXKM00')
In [7]: u
Out[7]: <ULID('01BX73KC0TNH409RTFD3ZXKM00')>
In [8]: u.str
Out[8]: '01BX73KC0TNH409RTFD3ZXKM00'
1U
is silently converted to 3Z
.
I checked Crockford's Base32 and values u
/U
are regarded as (and I quote) an Accidental obscenity...
I'm guessing we should error out if we encounter those? I checked the Java implementation we're currently using and we do get an IllegalArgumentException
for those cases (this hilarious test reflects it as well):
java.lang.IllegalArgumentException: Illegal character 'U'!
Is this behaviour something we should also be compliant with in the Python version?
Thanks again!
Eddie
Eddie,
Yes, this observation is correct. I was working on changes for this last night as well but didn't have time to finish so I just got out a quick release to the initial issue.
The current implementation will validate that the input string is at least in ASCII but yes, there are certain characters that don't overlap between that and Base32, so those will sneak in.
The code to solve this is relatively straight forward but I want to run it though some performance benchmarks to see the impact before committing to an implementation design.
I've created #60 to track this.
Best,
Andrew
Thanks Andrew!
Eddie,
I merged in a fix tonight and closed #60. Version 0.0.6
on PyPI should contain all of these changes.
Let me know if it works or if you run into any regressions.
Thanks again for reporting this!
Best,
Andrew
Thanks Andrew,
just made a few examples on my local and now it raises a ValueError
if there's a non-base32 character found.
Thanks again for the swift response to this!
Eddie