sizes incorrectly parsed as powers of two by default, ignoring IEEE 1541

Question

sizes incorrectly parsed as powers of two by default, ignoring IEEE 1541

anarcat opened this issue 9 years ago · comments

i believe this is incorrect:

    >>> parse_size('42')
    42
    >>> parse_size('1 KB')
    1024
    >>> parse_size('5 kilobyte')
    5120

It should rather be:

    >>> parse_size('1 KiB')
    1024
    >>> parse_size('5 kibibyte')
    5120
    >>> parse_size('1.5 GiB')
    1610612736
    >>> parse_size('1 KB')
    1000
    >>> parse_size('5 kilobyte')
    5000
    >>> parse_size('1.5 GB')
    1000000000

i know it will complicate parsing because you can't assume that the first letter is enough to tell them apart, but maybe check the two first letters? :)

see https://en.wikipedia.org/wiki/Kibibyte for more details.

Peter Odding · Answer 1 · Thu Jun 25 2015 05:18:47 GMT+0800 (China Standard Time)

Hi @anarcat and thanks for the feedback.

You are technically 100% correct and to be honest I'm not worried about complicating the parsing, however changing this now would break backwards compatibility and I'm quite religious about that :-). There's also the question of doing what is technically correct versus the DWIM aspect.

The least I can do is clearly document the caveat you've pointed out, but I won't change the implementation as suggested. What I can do is provide a second implementation that provides the parsing you suggest (it could be a keyword argument that has to be given to the parse_size() function to switch it away from backwards compatible mode).

anarcat · Answer 2 · Fri Jun 26 2015 02:06:35 GMT+0800 (China Standard Time)

maybe having a parse_disk_size and parse_memory_size? or parse_size_si and parse_size_eic? with parse_size being an alias to parse_size_eic...

Calum Lind · Answer 3 · Tue Apr 12 2016 05:18:11 GMT+0800 (China Standard Time)

I have to say I was discussing with another developer about using this module for our application but we were astonished to find it that it is defaulting to base 2 yet using the SI base 10 prefix, making it unusable. Your coding standards and documentation seems quite meticulous but this is quite a glaring issue regarding unit prefix standards and continuing the confusion for end users.

Manuel Leonhardt · Answer 4 · Thu May 05 2016 04:53:22 GMT+0800 (China Standard Time)

@xolox: I'm totally agreeing on providing some sort of backwards compatibility. Let's keep in mind, that even Windows does it wrong. Providing a second function to get true binary sizes would be great.

anarcat · Answer 5 · Thu May 05 2016 05:13:25 GMT+0800 (China Standard Time)

On 2016-05-04 16:53:24, Manuel Leonhardt wrote:

@xolox: I'm totally agreeing on providing some sort of backwards compatibility. Let's keep in mind, that even Windows does it wrong. Providing a second function to get true binary sizes would be great.

"Windows does it wrong" should never be an argument against doing the
right thing.

Otherwise everything would always be wrong and we would see no
progress.

Being cynical is the only way to deal with modern civilization — you
can't just swallow it whole.
- Frank Zappa

Manuel Leonhardt · Answer 6 · Thu May 05 2016 05:18:34 GMT+0800 (China Standard Time)

@anarcat: I'm totally agreeing. If it's wrong, it's wrong, period. Maybe I should have mentioned that I'm also in favour of a correct SI-formatting. I was just replying on @xolox point on DWIM.

Daniel Standage · Answer 7 · Thu Jun 30 2016 07:21:52 GMT+0800 (China Standard Time)

Any chance of this happening? It should be possible to maintain backwards compatibility of the interface with something like this, shouldn't it?

>>> parse_size('1 KB')
1024
>>> parse_size('1 KB', correctly=True)
1000

David D Lowe · Answer 8 · Thu Jul 14 2016 05:17:10 GMT+0800 (China Standard Time)

I created a pull request making it clearer in the documentation which units are used: #9

Peter Odding · Answer 9 · Thu Sep 29 2016 08:24:25 GMT+0800 (China Standard Time)

Hi everyone, sorry for taking so long to respond but thanks to everyone who chimed in! Reading through #4, #8 and #9 have convinced me to change format_size() and parse_size() to default to powers of ten instead of powers of two.

The logic in format_size() is almost as simple as it was before, but I've enhanced parse_size() to recognize symbols and names like KB, KiB, kilobyte and kibibyte. Both functions can revert to the old behavior by passing the keyword argument binary=True.

Because these changes are backwards incompatible I bumped the major version number to 2.0. On the other hand I'm not sure what would actually break or how fatal that breakage would be, and the discussion here definitely convinced me that the default needed to change :-).

I hope the new implementation satisfies everyone's concerns! Feedback is welcome.

anarcat · Answer 10 · Thu Sep 29 2016 09:00:29 GMT+0800 (China Standard Time)

thank you so much for doing the right thing! i haven't reviewed the actual implementation in details, but the logic is what i believe is the correct way.

thanks again!