hf-lin / ChatMusician

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

issue with data_prepreprocess.py on macOS; filename not correct

petergreis opened this issue · comments

Greetings

When trying to run the example from the readme of:

python model/train/data_preprocess.py -t m-a-p/ChatMusician-Base -i m-a-p/MusicPile-sft -o datasets --tokenize_fn sft

The script crashes out with the following:

Traceback (most recent call last):
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 110, in <module>
    main(args)
  File "/Users/petergreis/Library/CloudStorage/Dropbox/Leeds/Project/ChatMusician/model/train/data_preprocess.py", line 69, in main
    raw_dataset = load_dataset(args.input_file, cache_dir=tmp_cache_dir, keep_in_memory=False, encoding="utf8")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/petergreis/anaconda3/envs/ml/lib/python3.11/site-packages/datasets/builder.py", line 613, in _create_builder_config
    raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig ParquetConfig(name='default', version=0.0.0, data_dir=None, data_files={'train': ['data/train-*']}, description=None, batch_size=None, columns=None, features=None) doesn't have a 'encoding' key.

I have chased this down to this line:

filename = '.'.join(args.input_file.split("/")[-1].split(".")[:-1])

under macOS this yields an empty string, which causes the problem. Given that this is the argument "m-a-p/MusicPile-sft" in the original call, which part of the argument is intended for use? And, as this appears to be used to set the cache directory, should this not respect the environment variable HF_DATASETS_CACHE if set?

I've updated the script. You can pull it or you can manually modify the filename. This variable is just to get the filename of dataset that you are using.
And yes, the argument cache_dir in function load_dataset() has priority over HF_DATASETS_CACHE.

Confirmed that this fixes the filename issue