Natooz / MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶

Home Page:https://miditok.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handling Incompatible Files During `split_files_for_training`

Kinyugo opened this issue · comments

I'm using miditok to split my files for training purposes. However, I've encountered some incompatible files that prevent split_files_for_training from functioning correctly.

To address this, I implemented a pre-processing step to filter out files that couldn't be tokenized by miditok. Despite this, some incompatible files are still causing the whole process to fail.

I'd like to request guidance on how to effectively skip these incompatible files during the split_files_for_training stage.

Hi 👋
Thank for the insight!
Could you provide files that cause the method to crash? Corrupted files should be skipped when loading them fail so I assume the issue comes from the code itself. That would allow me to detect and fix what's wrong.

Also I'll move this method (and the associated split_score_per_note_density get_average_num_tokens_per_note split_dataset_to_subsequences methods) from the "PyTorch_data" module to the "utils" module (or maybe a dedicated "split_utils" module) of the lib in the next update (released soon) as it doesn't have to rely on PyTorch and should be able to be used with any DL framework.

Here are some examples.
error_files.tar.gz

Here is the error that I am getting:

  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py", line 122, in split_files_for_training
    score_chunks = split_score_per_note_density(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py", line 210, in split_score_per_note_density
    bar_ticks = get_bars_ticks(score)
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/utils.py", line 669, in get_bars_ticks
    if time_sigs[-1].time != max_tick:
       ~~~~~~~~~^^^^
IndexError

I think I may have found a temporary fix for the above error. Instead of copying the file. I dump the loaded file using Score.dump_midi method. This seems to ensure that the required metadata is present.

def copy_if_valid(src_path, dest_dir, tokenizer) :
    try:
        # attempt to load and tokenize the midi file
        score = Score(src_path)
        tokenizer(score)

        # copy the file maintaining the directory structure
        dest_path = dest_dir / src_path.relative_to(src_path.parts[0])
        dest_path.parent.mkdir(parents=True, exist_ok=True)
        score.dump_midi(dest_path) # use this instead of something like shutil.copy2
    except Exception as e:
        print(f"Error processing {src_path}: {e}")

However, there are other errors that occur that should also be handled. Here is an example:

{
	"name": "ZeroDivisionError",
	"message": "division by zero",
	"stack": "---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[16], line 7
      5 except Exception as e:
      6     print(f\"Failed to process {file}: {e}\")
----> 7     raise e 
      8     break

Cell In[16], line 4
      2 for file in filepaths:
      3     try:
----> 4         split_files_for_training([file], tokenizer, Path(tmpdir), max_seq_len=16384)
      5     except Exception as e:
      6         print(f\"Failed to process {file}: {e}\")

File ~/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py:94, in split_files_for_training(files_paths, tokenizer, save_dir, max_seq_len, average_num_tokens_per_note, num_overlap_bars, min_seq_len)
     88     return [
     89         path
     90         for path in save_dir.glob(\"**/*\")
     91         if path.suffix in SUPPORTED_MUSIC_FILE_EXTENSIONS
     92     ]
     93 if not average_num_tokens_per_note:
---> 94     average_num_tokens_per_note = get_average_num_tokens_per_note(
     95         tokenizer, files_paths[:MAX_NUM_FILES_NUM_TOKENS_PER_NOTE]
     96     )
     98 # Determine the deepest common subdirectory to replicate file tree
     99 root_dir = get_deepest_common_subdir(files_paths)

File ~/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py:304, in get_average_num_tokens_per_note(tokenizer, files_paths)
    302 if tokenizer.one_token_stream:
    303     num_notes = score.note_num()
--> 304     num_tokens_per_note.append(len(tok_seq) / num_notes)
    305 else:
    306     for track, seq in zip(score.tracks, tok_seq):

ZeroDivisionError: division by zero"
}

Thank you! The issue comes from the fact that these MIDIs do not have default time signatures. This is rare cases and I assumed symusic would automatically attribute the default 4/4 time signature. It is done in MidiTok only when tokenizing, here the method didn't handle this case which is the case now in #175. It will be merged soon and you'll be able to get the fix by installing MidiTok and symusic (needed for both until symusic v0.5.0 is released) from git.

I just red you last comment, I think it did work because when dumping a default time signature is written in the file too.

Thank you for the report!

(working on your last error when no note is present)

Thanks for your time. Will you be including an option to skip erroneous files during splitting?

Thanks for your time. Will you be including an option to skip erroneous files during splitting?

I prefer not too, as these errors just shouldn't happen and should be fixed/handled by MidiTok. Skipping them would be playing blinds :)

That makes sense. Perhaps a warning to the user would be good. Then maybe one can remove the files or inspect the data if too many files have warnings.

Ok I fixed the second issue which were caused by some methods not handling empty MIDIs (no tracks and/or no notes) in #175.

I'm currently testing locally with a large number of files from the Lakh dataset, I caught some other issues that's I'll fix before merging the branch.
Edit: that was a silly error with Octuple and the maximum number of bars to tokenize. Everything should pass now.

The fixes are merged in to the main branch!

You can run install from git to get them locally:

pip uninstall miditok symusic
pip install git+https://github.com/Yikai-Liao/symusic
pip install git+https://github.com/Natooz/MidiTok

Did the location of split_files_for_training change?

ImportError: cannot import name 'split_files_for_training' from 'miditok.pytorch_data' (/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/__init__.py)

Yes, I moved them in the utils module as commented above :)
miditok.utils.split_files_for_training

It seems that the empty tracks are still not handled.

  File "/home/kinyugo/learning/ml/audio_generation/melo_mamba_mmm/melo/scripts/data_preprocessing.py", line 58, in cli_main
    split_files_for_training(
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/split_utils.py", line 123, in split_files_for_training
    score_chunks = split_score_per_note_density(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/split_utils.py", line 221, in split_score_per_note_density
    tpb = num_tokens_per_bar[bi]
          ~~~~~~~~~~~~~~~~~~^^^^
IndexError: list index out of range

Could you provide the file causing the issue?
I tried with an empty file (no tracks and with tracks with no notes/controls) without being able to reproduce it.

This one causes division by zero error. I haven't tracked the other one yet.
error_files_2.tar.gz

Thank you! I tried to reproduce the error without success. Could you also share the tokenizer configuration you are working with?
In the meantime, I am testing with a larger amount of files from the Lakh dataset hoping to catch erroneous files.

Here it is:

def make_tokenizer() -> MMM:
    tokenizer_config = TokenizerConfig(
        use_tempos=True,
        use_programs=True,
        use_time_signatures=True,
        use_chords=True,
        use_rests=True,
        base_tokenizer="REMI",
        special_tokens=["PAD", "BOS", "EOS"],
    )

    return MMM(tokenizer_config)

Thank you. I managed to reproduce the error with several files from the Lakh dataset. I'll continue to work on it tomorrow and push the fixes.
Apologies for the inconvenience.

@Kinyugo this time it should work, I tested with multiple combinations and about 40k files without any error. 🙌
You can reinstall it from git

Thanks. I have tried it and now it works.