Handling Incompatible Files During `split_files_for_training`

Question

Handling Incompatible Files During `split_files_for_training`

Kinyugo opened this issue a month ago · comments

I'm using miditok to split my files for training purposes. However, I've encountered some incompatible files that prevent split_files_for_training from functioning correctly.

To address this, I implemented a pre-processing step to filter out files that couldn't be tokenized by miditok. Despite this, some incompatible files are still causing the whole process to fail.

I'd like to request guidance on how to effectively skip these incompatible files during the split_files_for_training stage.

Nathan Fradet · Answer 1 · Tue Jun 04 2024 16:17:03 GMT+0800 (China Standard Time)

Hi 👋
Thank for the insight!
Could you provide files that cause the method to crash? Corrupted files should be skipped when loading them fail so I assume the issue comes from the code itself. That would allow me to detect and fix what's wrong.

Also I'll move this method (and the associated split_score_per_note_density get_average_num_tokens_per_note split_dataset_to_subsequences methods) from the "PyTorch_data" module to the "utils" module (or maybe a dedicated "split_utils" module) of the lib in the next update (released soon) as it doesn't have to rely on PyTorch and should be able to be used with any DL framework.

Kinyugo · Answer 2 · Tue Jun 04 2024 17:16:42 GMT+0800 (China Standard Time)

Here are some examples.
error_files.tar.gz

Here is the error that I am getting:

  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py", line 122, in split_files_for_training
    score_chunks = split_score_per_note_density(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py", line 210, in split_score_per_note_density
    bar_ticks = get_bars_ticks(score)
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/utils.py", line 669, in get_bars_ticks
    if time_sigs[-1].time != max_tick:
       ~~~~~~~~~^^^^
IndexError

Kinyugo · Answer 3 · Tue Jun 04 2024 18:41:13 GMT+0800 (China Standard Time)

I think I may have found a temporary fix for the above error. Instead of copying the file. I dump the loaded file using Score.dump_midi method. This seems to ensure that the required metadata is present.

def copy_if_valid(src_path, dest_dir, tokenizer) :
    try:
        # attempt to load and tokenize the midi file
        score = Score(src_path)
        tokenizer(score)

        # copy the file maintaining the directory structure
        dest_path = dest_dir / src_path.relative_to(src_path.parts[0])
        dest_path.parent.mkdir(parents=True, exist_ok=True)
        score.dump_midi(dest_path) # use this instead of something like shutil.copy2
    except Exception as e:
        print(f"Error processing {src_path}: {e}")

However, there are other errors that occur that should also be handled. Here is an example:

{
	"name": "ZeroDivisionError",
	"message": "division by zero",
	"stack": "---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[16], line 7
      5 except Exception as e:
      6     print(f\"Failed to process {file}: {e}\")
----> 7     raise e 
      8     break

Cell In[16], line 4
      2 for file in filepaths:
      3     try:
----> 4         split_files_for_training([file], tokenizer, Path(tmpdir), max_seq_len=16384)
      5     except Exception as e:
      6         print(f\"Failed to process {file}: {e}\")

File ~/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py:94, in split_files_for_training(files_paths, tokenizer, save_dir, max_seq_len, average_num_tokens_per_note, num_overlap_bars, min_seq_len)
     88     return [
     89         path
     90         for path in save_dir.glob(\"**/*\")
     91         if path.suffix in SUPPORTED_MUSIC_FILE_EXTENSIONS
     92     ]
     93 if not average_num_tokens_per_note:
---> 94     average_num_tokens_per_note = get_average_num_tokens_per_note(
     95         tokenizer, files_paths[:MAX_NUM_FILES_NUM_TOKENS_PER_NOTE]
     96     )
     98 # Determine the deepest common subdirectory to replicate file tree
     99 root_dir = get_deepest_common_subdir(files_paths)

File ~/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py:304, in get_average_num_tokens_per_note(tokenizer, files_paths)
    302 if tokenizer.one_token_stream:
    303     num_notes = score.note_num()
--> 304     num_tokens_per_note.append(len(tok_seq) / num_notes)
    305 else:
    306     for track, seq in zip(score.tracks, tok_seq):

ZeroDivisionError: division by zero"
}

Nathan Fradet · Answer 4 · Tue Jun 04 2024 18:44:17 GMT+0800 (China Standard Time)

Thank you! The issue comes from the fact that these MIDIs do not have default time signatures. This is rare cases and I assumed symusic would automatically attribute the default 4/4 time signature. It is done in MidiTok only when tokenizing, here the method didn't handle this case which is the case now in #175. It will be merged soon and you'll be able to get the fix by installing MidiTok and symusic (needed for both until symusic v0.5.0 is released) from git.

I just red you last comment, I think it did work because when dumping a default time signature is written in the file too.

Thank you for the report!

Nathan Fradet · Answer 5 · Tue Jun 04 2024 18:46:41 GMT+0800 (China Standard Time)

(working on your last error when no note is present)

Kinyugo · Answer 6 · Tue Jun 04 2024 18:46:54 GMT+0800 (China Standard Time)

Thanks for your time. Will you be including an option to skip erroneous files during splitting?

Nathan Fradet · Answer 7 · Tue Jun 04 2024 18:48:27 GMT+0800 (China Standard Time)

Thanks for your time. Will you be including an option to skip erroneous files during splitting?

I prefer not too, as these errors just shouldn't happen and should be fixed/handled by MidiTok. Skipping them would be playing blinds :)

Kinyugo · Answer 8 · Tue Jun 04 2024 18:49:10 GMT+0800 (China Standard Time)

That makes sense. Perhaps a warning to the user would be good. Then maybe one can remove the files or inspect the data if too many files have warnings.

Nathan Fradet · Answer 9 · Tue Jun 04 2024 20:20:38 GMT+0800 (China Standard Time)

Ok I fixed the second issue which were caused by some methods not handling empty MIDIs (no tracks and/or no notes) in #175.

I'm currently testing locally with a large number of files from the Lakh dataset, I caught some other issues that's I'll fix before merging the branch.
Edit: that was a silly error with Octuple and the maximum number of bars to tokenize. Everything should pass now.

Nathan Fradet · Answer 10 · Tue Jun 04 2024 22:11:20 GMT+0800 (China Standard Time)

The fixes are merged in to the main branch!

You can run install from git to get them locally:

pip uninstall miditok symusic
pip install git+https://github.com/Yikai-Liao/symusic
pip install git+https://github.com/Natooz/MidiTok

Kinyugo · Answer 11 · Wed Jun 05 2024 02:38:25 GMT+0800 (China Standard Time)

Did the location of split_files_for_training change?

ImportError: cannot import name 'split_files_for_training' from 'miditok.pytorch_data' (/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/__init__.py)

Nathan Fradet · Answer 12 · Wed Jun 05 2024 02:39:56 GMT+0800 (China Standard Time)

Yes, I moved them in the utils module as commented above :)
miditok.utils.split_files_for_training

Kinyugo · Answer 13 · Wed Jun 05 2024 04:03:11 GMT+0800 (China Standard Time)

It seems that the empty tracks are still not handled.

  File "/home/kinyugo/learning/ml/audio_generation/melo_mamba_mmm/melo/scripts/data_preprocessing.py", line 58, in cli_main
    split_files_for_training(
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/split_utils.py", line 123, in split_files_for_training
    score_chunks = split_score_per_note_density(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/split_utils.py", line 221, in split_score_per_note_density
    tpb = num_tokens_per_bar[bi]
          ~~~~~~~~~~~~~~~~~~^^^^
IndexError: list index out of range

Nathan Fradet · Answer 14 · Wed Jun 05 2024 04:14:16 GMT+0800 (China Standard Time)

Could you provide the file causing the issue?
I tried with an empty file (no tracks and with tracks with no notes/controls) without being able to reproduce it.

Kinyugo · Answer 15 · Wed Jun 05 2024 04:45:44 GMT+0800 (China Standard Time)

This one causes division by zero error. I haven't tracked the other one yet.
error_files_2.tar.gz

Nathan Fradet · Answer 16 · Wed Jun 05 2024 04:57:48 GMT+0800 (China Standard Time)

Thank you! I tried to reproduce the error without success. Could you also share the tokenizer configuration you are working with?
In the meantime, I am testing with a larger amount of files from the Lakh dataset hoping to catch erroneous files.

Kinyugo · Answer 17 · Wed Jun 05 2024 05:08:27 GMT+0800 (China Standard Time)

Here it is:

def make_tokenizer() -> MMM:
    tokenizer_config = TokenizerConfig(
        use_tempos=True,
        use_programs=True,
        use_time_signatures=True,
        use_chords=True,
        use_rests=True,
        base_tokenizer="REMI",
        special_tokens=["PAD", "BOS", "EOS"],
    )

    return MMM(tokenizer_config)

Nathan Fradet · Answer 18 · Wed Jun 05 2024 05:44:23 GMT+0800 (China Standard Time)

Thank you. I managed to reproduce the error with several files from the Lakh dataset. I'll continue to work on it tomorrow and push the fixes.
Apologies for the inconvenience.

Nathan Fradet · Answer 19 · Wed Jun 05 2024 16:05:22 GMT+0800 (China Standard Time)

@Kinyugo this time it should work, I tested with multiple combinations and about 40k files without any error. 🙌
You can reinstall it from git

Kinyugo · Answer 20 · Wed Jun 05 2024 23:31:37 GMT+0800 (China Standard Time)

Thanks. I have tried it and now it works.