NREL / buildstockbatch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Schedules are no longer being processed into timeseries output

nmerket opened this issue · comments

Describe the bug

The timeseries output isn't including the schedules even though it should. I've tracked it down to this line of code:

if file.endswith('schedules.csv'):

The schedule files have now changed and are named like schedules20230221-10641-yoga5e.csv, so that doesn't pick them up. When I changed that line to accept the new format for a schedule file name, I got some inconsistencies in the output parquet files on the combining step in postprocessing:

INFO:2023-02-23 13:19:13:buildstockbatch.postprocessing:Gathering all the parquet files in /Users/nmerket/projects/resstock/resstock/project_national/national_upgrades/parquet/timeseries/up*/*.parquet
INFO:2023-02-23 13:19:13:buildstockbatch.postprocessing:Gathered 14 files. Now writing _metadata
2023-02-23 13:19:13,814 - distributed.worker - WARNING - Compute Failed
Key:       gen-metadata-2ffd34d5b9f25259ded750abaa6a4533
Function:  aggregate_metadata
args:      ([<pyarrow._parquet.FileMetaData object at 0x11abafbf0>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 66
  num_rows: 2864520
  num_row_groups: 2
  format_version: 2.6
  serialized_size: 46791, <pyarrow._parquet.FileMetaData object at 0x118352ac0>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 66
  num_rows: 1427880
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 45075, <pyarrow._parquet.FileMetaData object at 0x11abae610>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 66
  num_rows: 1427880
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 46617, <pyarrow._parquet.FileMetaData object at 0x11a54f240>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 66
  num_rows: 1419120
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 45034, <pyarrow._parquet.FileMetaData object at 0x11abaf420>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 66
  num_rows: 1410360
  num_row_groups: 1
  format_ve
kwargs:    {}
Exception: 'RuntimeError(\'Schemas are inconsistent, try using `to_parquet(..., schema="infer")`, or pass an explicit pyarrow schema. Such as `to_parquet(..., schema={"column1": pa.string()})`\')'

Traceback (most recent call last):
  File "/Users/nmerket/mambaforge/envs/buildstock/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 76, in _append_row_groups
    metadata.append_row_groups(md)
  ^^^^^^^^^^^^^^^^^
  File "pyarrow/_parquet.pyx", line 799, in pyarrow._parquet.FileMetaData.append_row_groups
RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 64 differ.
column descriptor = {
  name: schedules_vacancy,
  path: schedules_vacancy,
  physical_type: DOUBLE,
  converted_type: NONE,
  logical_type: None,
  max_definition_level: 1,
  max_repetition_level: 0,
}
column descriptor = {
  name: schedules_vacancy,
  path: schedules_vacancy,
  physical_type: INT64,
  converted_type: NONE,
  logical_type: None,
  max_definition_level: 1,
  max_repetition_level: 0,
}

@rajeee

@rajeee It seems I was not using any partitioning on those outputs.

Addressed by #355