ilanschnell / bitarray

efficient arrays of booleans for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Constructor fed with iterable bytes starting with a leading 0x00-0x07 silently treats the bytes as a python pickle...

manor opened this issue · comments

The documentation of the bitarray class claims that the initializer can be:

 |  `int`: Create a bitarray of given integer length.  The initial values are
 |  uninitialized.
 |
 |  `str`: Create bitarray from a string of `0` and `1`.
 |
 |  `iterable`: Create bitarray from iterable or sequence of integers 0 or 1.

Since bytes are iterable, it is natural to assume that providing bytes as an initializer will be equivalent to a .frombytes() call... However, it appears that if the leading byte is in the range '0x00' through '0x07', the initializer is assumed to be a python pickle. The difference is counter-intuitive and confusing since a call fed from a binary file with a leading 0x00 will yield an almost "equivalent" bitarray with exactly one byte missing... Which can be hard to spot :-)

from bitarray import bitarray

b = bitarray(b'\x00\x0F', endian="big"); print(b)
# bitarray('00001111')

b = bitarray(endian="big"); b.frombytes(b'\x00\x0F'); print(b)
# bitarray('0000000000001111')

These are even more counter-intuitive:

b = bitarray(b'\x01\x0F', endian="big"); print(b)
# bitarray('0000111')

b = bitarray(b'\x02\x0F', endian="big"); print(b)
# bitarray('000011')

b = bitarray(b'\x03\x0F', endian="big"); print(b)
# bitarray('00001')

b = bitarray(b'\x04\x0F', endian="big"); print(b)
# bitarray('0000')

b = bitarray(b'\x05\x0F', endian="big"); print(b)
# bitarray('000')

b = bitarray(b'\x06\x0F', endian="big"); print(b)
# bitarray('00')

b = bitarray(b'\x07\x0F', endian="big"); print(b)
# bitarray('0')

Thank you for using bitarray and filing this issue! The described behavior is correct, i.e. then a bitarray is initialized with bytes and the leading one is the range 0x00 to 0x07, then the remaining bytes are treated as the raw buffer (with the leading byte indicating how many unused bits are present).
This was behavior is only used to support unpickling, and therefore is not documented.
I agree that this might be confusing as bytes are also iterable, e.g. the iterable b'\1\0\1' is not interpreted as bitarray('101').

What would you suggest to fix this issue? Change the documentation?

Hi, another example of how this can surprise:

>>> bitarray(bytearray([0, 1]))
bitarray('01')
>>> bitarray(bytes([0, 1]))
Traceback (most recent call last):
ValueError: endianness missing for pickle

I don't really wish to weigh in on what should be happening, but I don't think that documenting it is the answer. I'm not very familiar with pickle, but I thought you could avoid using __init__ when unpickling? Other options seem messy to me.

@scott-griffiths I agree that it would be best to avoid __init__ in the unpickling process completely. When looking at the Python standard library array module, I see an internal function _array_reconstructor which does the unpickling:

>>> import array
>>> array._array_reconstructor(array.array, 'b', 1, bytes([11, 22, 33]))
array('b', [11, 22, 33])

I will try to use this approach in bitarray as well, while keeping things backwards compatible.

@scott-griffiths I just created #207, which uses a reconstructor function and also explains the history and technical details.

TL;DR: To allow a backwards compatibility transition, this issue (#206) can be closed a year from now!