CI issues with newer version of pandas and existing parquet files in repo

Question

CI issues with newer version of pandas and existing parquet files in repo

nmerket opened this issue a year ago · comments

Describe the bug

The CI is returning errors on all runs in this test for python > 3.8. The old parquet files store an object datatype while the newer ones have a python[string] datatype.

To Reproduce
Steps to reproduce the behavior:

Happens on any CI run.

Expected behavior

Tests pass

Logs

From the CI logs:

        # results parquet
        test_pq = pd.read_parquet(os.path.join(test_path, 'baseline', 'results_up00.parquet')).sort_values('building_id')\
            .reset_index().drop(columns=['index'])
        reference_pq = pd.read_parquet(os.path.join(reference_path, 'baseline', 'results_up00.parquet'))\
            .sort_values('building_id').reset_index().drop(columns=['index'])
>       pd.testing.assert_frame_equal(test_pq, reference_pq)
E       AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="completed_status") are different
E       
E       Attribute "dtype" are different
E       [left]:  string[python]
E       [right]: object

Platform (please complete the following information):

Simulation platform: ubuntu-latest on GitHub Actions
BuildStockBatch version, branch, or sha: develop
resstock or comstock repo version, branch, or sha: develop
Local Desktop OS: [e.g. Windows, Mac, Linux, especially important if running locally]

Additional context

Two ideas for how to address this:

(easy but could break again) Open the testing parquet files in the repo in a newer version of pandas, convert the columns to string and save them back. This should solve the error, but something like this may happen again.
(harder but more maintainable in the long run) Change these tests to instead of comparing an expected parquet to a generated one, use the newer integration test framework where you can actually run ResStock and generate results. Then you'd check that those results have expected columns and such without comparing two dataframes directly.