Non-Crockford's Base32 letters converted differently in Java or Python implementations

Question

Non-Crockford's Base32 letters converted differently in Java or Python implementations

eberbis opened this issue 7 years ago · comments

Hi Andrew,

first of all, thanks for the amazing library, we've been using a lot!

I have a doubt regarding how we fix the conversion of ULIDs which are not following Crockford's Base32 standard.

We are using Lua to generate some guids (https://github.com/Tieske/ulid.lua) and for some reason, we get from time to time letters outside the Crockford's Base32.
While trying to fix this on our side (we're not sure how this is happening to be honest), we realised that Java and Python implementations silently corrects this issue in different ways:

Java

ULID.Value ulidValueFromString = ULID.parseULID("01BX73KC0TNH409RTFD1JXKmO0")
--> "01BX73KC0TNH409RTFD1JXKM00"

mO is silently converted into M0

Python

In [1]: import ulid

In [2]: u = ulid.from_str('01BX73KC0TNH409RTFD1JXKmO0')

In [3]: u
Out[3]: <ULID('01BX73KC0TNH409RTFD1JXKQZ0')>

In [4]: u.str
Out[4]: '01BX73KC0TNH409RTFD1JXKQZ0'

mO is silently converted into QZ

Shouldn't the python library behave as the Java one as per the Crockford's Base32 spec, converting L and I to 1 and O to 0 and only upper casing lower case letters instead of changing them?

Thanks a lot in advance!

Eddie

Andrew Hawker · Answer 1 · Thu Oct 26 2017 02:27:46 GMT+0800 (China Standard Time)

Eddie,

Thanks for reporting the issue. I've got a couple ideas but this will need some deeper investigation. No promises but I should have some free time to dive-in within the next couple of days.

Best,
Andrew

Andrew Hawker · Answer 2 · Thu Oct 26 2017 11:19:24 GMT+0800 (China Standard Time)

Eddie,

Thanks again for reporting this and I apologize for the issue!

I merged in a fix and pushed out version 0.0.5 that should address it.

Best,
Andrew

Eddie · Answer 3 · Thu Oct 26 2017 17:12:02 GMT+0800 (China Standard Time)

Thanks Andrew, that was super fast!

Eddie · Answer 4 · Thu Oct 26 2017 18:32:49 GMT+0800 (China Standard Time)

Andrew,

sorry to be a pain.

I was testing those cases and they now work perfectly (thanks again for being so fast!).
While doing so, I think, I may have found a related issue. I saw that when decoding u/U values, in our version we silently convert those to a new value. This is an example:

In [6]: u = ulid.from_str('01BX73KC0TNH409RTFD1UXKM00')

In [7]: u
Out[7]: <ULID('01BX73KC0TNH409RTFD3ZXKM00')>

In [8]: u.str
Out[8]: '01BX73KC0TNH409RTFD3ZXKM00'

1U is silently converted to 3Z.

I checked Crockford's Base32 and values u/U are regarded as (and I quote) an Accidental obscenity...

I'm guessing we should error out if we encounter those? I checked the Java implementation we're currently using and we do get an IllegalArgumentException for those cases (this hilarious test reflects it as well):

java.lang.IllegalArgumentException: Illegal character 'U'!

Is this behaviour something we should also be compliant with in the Python version?

Thanks again!

Eddie

Andrew Hawker · Answer 5 · Fri Oct 27 2017 00:11:01 GMT+0800 (China Standard Time)

Eddie,

Yes, this observation is correct. I was working on changes for this last night as well but didn't have time to finish so I just got out a quick release to the initial issue.

The current implementation will validate that the input string is at least in ASCII but yes, there are certain characters that don't overlap between that and Base32, so those will sneak in.

The code to solve this is relatively straight forward but I want to run it though some performance benchmarks to see the impact before committing to an implementation design.

I've created #60 to track this.

Best,
Andrew

Eddie · Answer 6 · Fri Oct 27 2017 01:30:09 GMT+0800 (China Standard Time)

Thanks Andrew!

Andrew Hawker · Answer 7 · Sun Oct 29 2017 07:59:27 GMT+0800 (China Standard Time)

Eddie,

I merged in a fix tonight and closed #60. Version 0.0.6 on PyPI should contain all of these changes.

Let me know if it works or if you run into any regressions.

Thanks again for reporting this!

Best,
Andrew

Eddie · Answer 8 · Sun Oct 29 2017 18:20:23 GMT+0800 (China Standard Time)

Thanks Andrew,

just made a few examples on my local and now it raises a ValueError if there's a non-base32 character found.

Thanks again for the swift response to this!

Eddie