MIT-LCP / physionet-build

The new PhysioNet platform.

Home Page:https://physionet.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mimic-iv-demo on HuggingFace raises DatasetGenerationCastError

tompollard opened this issue · comments

We have talked a little about trying to integrate the HuggingFace platform with PhysioNet (in particular, making is easier for the HuggingFace community to work with PhysioNet datasets).

A while back, Alistair uploaded a copy of the MIMIC-IV demo to: https://huggingface.co/datasets/physionet/mimic-iv-demo. I thought I'd have a quick play around with this.

When attempting to load the dataset using HuggingFace's load_dataset(), I receive a DatasetGenerationCastError:

# Running in collab
!pip install datasets

from datasets import load_dataset
mimic = load_dataset('physionet/mimic-iv-demo')

Traceback:

---------------------------------------------------------------------------
CastError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1988                     try:
-> 1989                         writer.write_table(table)
   1990                     except CastError as cast_error:

8 frames
[/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py](https://localhost:8080/#) in write_table(self, pa_table, writer_batch_size)
    589         pa_table = pa_table.combine_chunks()
--> 590         pa_table = table_cast(pa_table, self._schema)
    591         if self.embed_local_files:

[/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in table_cast(table, schema)
   2239     if table.schema != schema:
-> 2240         return cast_table_to_schema(table, schema)
   2241     elif table.schema.metadata != schema.metadata:

[/usr/local/lib/python3.10/dist-packages/datasets/table.py](https://localhost:8080/#) in cast_table_to_schema(table, schema)
   2193     if sorted(table.column_names) != sorted(features):
-> 2194         raise CastError(
   2195             f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match",

CastError: Couldn't cast
subject_id: int64
hadm_id: int64
admittime: string
dischtime: string
deathtime: string
admission_type: string
admit_provider_id: string
admission_location: string
discharge_location: string
insurance: string
language: string
marital_status: string
race: string
edregtime: string
edouttime: string
hospital_expire_flag: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2220
to
{'subject_id': Value(dtype='int64', id=None)}
because column names don't match

During handling of the above exception, another exception occurred:

DatasetGenerationCastError                Traceback (most recent call last)
[<ipython-input-15-0345be2aa2fc>](https://localhost:8080/#) in <cell line: 1>()
----> 1 mimic = load_dataset('physionet/mimic-iv-demo')

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2572 
   2573     # Download and prepare data
-> 2574     builder_instance.download_and_prepare(
   2575         download_config=download_config,
   2576         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
   1003                         if num_proc is not None:
   1004                             prepare_split_kwargs["num_proc"] = num_proc
-> 1005                         self._download_and_prepare(
   1006                             dl_manager=dl_manager,
   1007                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1098             try:
   1099                 # Prepare split will record examples associated to the split
-> 1100                 self._prepare_split(split_generator, **prepare_split_kwargs)
   1101             except OSError as e:
   1102                 raise OSError(

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1858             job_id = 0
   1859             with pbar:
-> 1860                 for job_id, done, content in self._prepare_split_single(
   1861                     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1862                 ):

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1989                         writer.write_table(table)
   1990                     except CastError as cast_error:
-> 1991                         raise DatasetGenerationCastError.from_cast_error(
   1992                             cast_error=cast_error,
   1993                             builder_name=self.info.builder_name,

DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 15 new columns ({'admission_location', 'race', 'admittime', 'dischtime', 'hadm_id', 'language', 'discharge_location', 'admission_type', 'edregtime', 'edouttime', 'admit_provider_id', 'marital_status', 'insurance', 'hospital_expire_flag', 'deathtime'})

This happened while the csv dataset builder was generating data using

/root/.cache/huggingface/datasets/downloads/5a3898fd1af7dd22d0359508d82978ba6c36a780c8aba0b1b15a9437a90adedc

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)

A while back, Alistair uploaded a copy of the MIMIC-IV demo to: https://huggingface.co/datasets/physionet/mimic-iv-demo

Please, please, please don't use unversioned URLs :(

Please, please, please don't use unversioned URLs :(

Yeah, that's a good point. I think this was really intended as a trial run. Clearly lots more thought needed about how to integrate the PhysioNet and HuggingFace platforms in a sensible way.

Without spending too much time thinking about this, as a start I like the idea of:

  1. Adding data/model loader scripts to a new PhysioNet Python package
  2. Providing guidance on how to use these scripts on HuggingFace (or even better, incorporate into HuggingFace tools).

Side note, but I have just switched the https://huggingface.co/datasets/physionet/mimic-iv-demo dataset to "Private", which I think means that anyone who isn't part of the project will get a 404. @bemoody if you have an account on HuggingFace then let me know and I'll add you to the PhysioNet project.