BUG: Unable to open Stata 118 or 119 files saved in big-endian format that contain strL data

Question

BUG: Unable to open Stata 118 or 119 files saved in big-endian format that contain strL data

cmjcharlton opened this issue a month ago · comments

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
# Both of the following lines fail (the data files are provided in the issue description)
df = pd.read_stata("stata12_be_118.dta")
df = pd.read_stata("stata12_be_119.dta")

Issue Description

If I attempt to open a 118 format file saved in big-endian format that contains strL data I get the following error:

>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_118.dta")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
    return reader.read()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1775, in read
    data = self._insert_strls(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1876, in _insert_strls
    data.isetitem(i, [self.GSO[str(k)] for k in data.iloc[:, i]])
                      ~~~~~~~~^^^^^^^^
KeyError: '844424930131969'

The same is true if I repeat this for a 119 format file:

>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_119.dta")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
    return reader.read()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1775, in read
    data = self._insert_strls(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1876, in _insert_strls
    data.isetitem(i, [self.GSO[str(k)] for k in data.iloc[:, i]])
                      ~~~~~~~~^^^^^^^^
KeyError: '3298534883329'

The equivalent 117 format file works fine:

>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_117.dta")
>>> df
      x    y                  z
0   1.0  abc          abcdefghi
1   3.0  cba  qwertywertyqwerty
2  93.0                    strl

This occurs due to a failed lookup for the strL value due to a mismatch in the expected key for the following reason:

strL content is stored separately to the main data in records identified by a (v, o) - (variable, observation) value. This is then referenced from the main data to associate a particular string value with a position in the data. In format 117 v and o were both stored in 4 bytes and there was an exact match between the value stored in the main data and in the strL records. Format 118 and later increased o to be stored in 8 bytes, allowing more observations to be held in the data, however it did not change the storage size in the main data for referencing this, resulting in a need for a packed storage value where some of the high bytes were removed from v and o to allow both values to fit in 8 bytes. Using the notation of letters to represent bytes in v and numbers to represent bytes in o this means that the (v, o) index:

(ABCD, 12345678)
would be referenced in 118 by:
AB123456
and in 119 by:
ABC12345

In big-endian format:
(DCBA, 87654321)
would be reference in 118 by:
BA654321
and in 119 by:
CBA54321

When looking up values Pandas takes the approach of converting (v, o) in the strL records into the packed form and treating this as an 8-byte integer, rather than expanding out the values in the data into separate 4 and 8-byte integers. The current code branch for little-endian gives the expected result:

>>> buf = 'ABCD12345678'
>>> v_size = 2 # 118 format
>>> buf[0:v_size] + buf[4 : (12 - v_size)]
'AB123456'
>>> v_size = 3 # 119 format
>>> buf[0:v_size] + buf[4 : (12 - v_size)]
'ABC12345'

however the big-endian path is incorrect:

>>> buf = 'DCBA87654321'
>>> v_size = 2 # 118 format
>>> buf[0:v_size] + buf[(4 + v_size) :]
'DC654321'
>>> v_size = 3 # 119 format
>>> buf[0:v_size] + buf[(4 + v_size) :]
'DCB54321'

it should instead be:

>>> buf = 'DCBA87654321'
>>> v_size = 2 # 118 format
>>> buf[4 - v_size:4] + buf[(4 + v_size) :]
'BA654321'
>>> v_size = 3 # 119 format
>>> buf[4 - v_size:4] + buf[(4 + v_size) :]
'CBA54321'

Once the packed value has been determined it should have byteorder applied, as this happens to the main data.

stata12_be.zip

Expected Behavior

I would expect the file to load successfully, as it does in Stata:

. use "stata12_be_118.dta"

. list

     +------------------------------+
     |  x     y                   z |
     |------------------------------|
  1. |  1   abc           abcdefghi |
  2. |  3   cba   qwertywertyqwerty |
  3. | 93                      strl |
     +------------------------------+

. use "stata12_be_119.dta"

. list

     +------------------------------+
     |  x     y                   z |
     |------------------------------|
  1. |  1   abc           abcdefghi |
  2. |  3   cba   qwertywertyqwerty |
  3. | 93                      strl |
     +------------------------------+

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : 2.10.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None