BUG: Unable to open Stata 118 or 119 files saved in big-endian format that contain strL data
cmjcharlton opened this issue · comments
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# Both of the following lines fail (the data files are provided in the issue description)
df = pd.read_stata("stata12_be_118.dta")
df = pd.read_stata("stata12_be_119.dta")
Issue Description
If I attempt to open a 118 format file saved in big-endian format that contains strL data I get the following error:
>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_118.dta")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
return reader.read()
^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1775, in read
data = self._insert_strls(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1876, in _insert_strls
data.isetitem(i, [self.GSO[str(k)] for k in data.iloc[:, i]])
~~~~~~~~^^^^^^^^
KeyError: '844424930131969'
The same is true if I repeat this for a 119 format file:
>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_119.dta")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
return reader.read()
^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1775, in read
data = self._insert_strls(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1876, in _insert_strls
data.isetitem(i, [self.GSO[str(k)] for k in data.iloc[:, i]])
~~~~~~~~^^^^^^^^
KeyError: '3298534883329'
The equivalent 117 format file works fine:
>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_117.dta")
>>> df
x y z
0 1.0 abc abcdefghi
1 3.0 cba qwertywertyqwerty
2 93.0 strl
This occurs due to a failed lookup for the strL value due to a mismatch in the expected key for the following reason:
strL content is stored separately to the main data in records identified by a (v, o) - (variable, observation) value. This is then referenced from the main data to associate a particular string value with a position in the data. In format 117 v and o were both stored in 4 bytes and there was an exact match between the value stored in the main data and in the strL records. Format 118 and later increased o to be stored in 8 bytes, allowing more observations to be held in the data, however it did not change the storage size in the main data for referencing this, resulting in a need for a packed storage value where some of the high bytes were removed from v and o to allow both values to fit in 8 bytes. Using the notation of letters to represent bytes in v and numbers to represent bytes in o this means that the (v, o) index:
(ABCD, 12345678)
would be referenced in 118 by:
AB123456
and in 119 by:
ABC12345
In big-endian format:
(DCBA, 87654321)
would be reference in 118 by:
BA654321
and in 119 by:
CBA54321
When looking up values Pandas takes the approach of converting (v, o) in the strL records into the packed form and treating this as an 8-byte integer, rather than expanding out the values in the data into separate 4 and 8-byte integers. The current code branch for little-endian gives the expected result:
>>> buf = 'ABCD12345678'
>>> v_size = 2 # 118 format
>>> buf[0:v_size] + buf[4 : (12 - v_size)]
'AB123456'
>>> v_size = 3 # 119 format
>>> buf[0:v_size] + buf[4 : (12 - v_size)]
'ABC12345'
however the big-endian path is incorrect:
>>> buf = 'DCBA87654321'
>>> v_size = 2 # 118 format
>>> buf[0:v_size] + buf[(4 + v_size) :]
'DC654321'
>>> v_size = 3 # 119 format
>>> buf[0:v_size] + buf[(4 + v_size) :]
'DCB54321'
it should instead be:
>>> buf = 'DCBA87654321'
>>> v_size = 2 # 118 format
>>> buf[4 - v_size:4] + buf[(4 + v_size) :]
'BA654321'
>>> v_size = 3 # 119 format
>>> buf[4 - v_size:4] + buf[(4 + v_size) :]
'CBA54321'
Once the packed value has been determined it should have byteorder applied, as this happens to the main data.
Expected Behavior
I would expect the file to load successfully, as it does in Stata:
. use "stata12_be_118.dta"
. list
+------------------------------+
| x y z |
|------------------------------|
1. | 1 abc abcdefghi |
2. | 3 cba qwertywertyqwerty |
3. | 93 strl |
+------------------------------+
. use "stata12_be_119.dta"
. list
+------------------------------+
| x y z |
|------------------------------|
1. | 1 abc abcdefghi |
2. | 3 cba qwertywertyqwerty |
3. | 93 strl |
+------------------------------+
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.12.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : 2.10.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None