Natooz / MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶

Home Page:https://miditok.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: chr() arg not in range(0x110000)

juancopi81 opened this issue · comments

Hi,

I am trying to tokenize the LMD clean dataset, but it's giving me an error message.

This is my code:

from miditok.tokenizations import remi_plus
# Our parameters
pitch_range = range(21, 109)
beat_res = {(0, 4): 16, (4, 12): 4}
nb_velocities = 32

special_tokens = ["PAD", "BOS", "EOS", "MASK"]

# Creates the tokenizer convert MIDIs to tokens
tokens_path = Path('/content/drive/MyDrive/8 Beats/AI/Datasets/Symbolic/lmd/miditok/no_bpe')
tokenizer = REMIPlus(pitch_range, beat_res, nb_velocities, special_tokens=special_tokens) # MMMM
midi_paths = list(Path('/content/drive/MyDrive/8 Beats/AI/Datasets/Symbolic/lmd/clean_midi').glob('**/*.mid')) + list(Path('/content/drive/MyDrive/8 Beats/AI/Datasets/Symbolic/lmd/clean_midi').glob('**/*.midi'))
tokenizer.tokenize_midi_dataset(midi_paths, tokens_path, validation_fn=midi_valid)

# Learn and apply BPE to data we just tokenized
tokens_bpe_path = Path('/content/drive/MyDrive/8 Beats/AI/Datasets/Symbolic/lmd/miditok/boe')
tokens_bpe_path.mkdir(exist_ok=True, parents=True)
tokenizer.learn_bpe(
    vocab_size=1000,
    tokens_paths=list(tokens_path.glob("**/*.json")),
    start_from_empty_voc=False,
)
tokenizer.apply_bpe_to_dataset(
    tokens_path,
    tokens_bpe_path,
)

And this is the error I get:

[/usr/local/lib/python3.10/dist-packages/miditok/midi_tokenizer.py](https://localhost:8080/#) in tokenize_midi_dataset(self, midi_paths, out_dir, validation_fn, data_augment_offsets, apply_bpe, save_programs, logging)
   1580 
   1581             # Converting the MIDI to tokens and saving them as json
-> 1582             tokens = self(
   1583                 midi, apply_bpe_if_possible=False
   1584             )  # BPE will be applied after if ordered

[/usr/local/lib/python3.10/dist-packages/miditok/midi_tokenizer.py](https://localhost:8080/#) in __call__(self, obj, *args, **kwargs)
   1831         # Tokenize MIDI
   1832         if isinstance(obj, MidiFile):
-> 1833             return self.midi_to_tokens(obj, *args, **kwargs)
   1834 
   1835         # Loads a file (.mid or .json)

[/usr/local/lib/python3.10/dist-packages/miditok/midi_tokenizer.py](https://localhost:8080/#) in midi_to_tokens(self, midi, apply_bpe_if_possible, *args, **kwargs)
    475         }
    476 
--> 477         tokens = self._midi_to_tokens(midi, *args, **kwargs)
    478 
    479         if apply_bpe_if_possible and self.has_bpe:

[/usr/local/lib/python3.10/dist-packages/miditok/tokenizations/remi_plus.py](https://localhost:8080/#) in _midi_to_tokens(self, midi, *args, **kwargs)
    350         """
    351         # Convert each track to tokens
--> 352         events = self.__notes_to_events(midi.instruments)
    353         tok_sequence = TokSequence(events=cast(List[Union[Event, List[Event]]], events))
    354         self.complete_sequence(tok_sequence)

[/usr/local/lib/python3.10/dist-packages/miditok/tokenizations/remi_plus.py](https://localhost:8080/#) in __notes_to_events(self, tracks)
    129             if self.max_bar_embedding < nb_bars:
    130                 for i in range(self.max_bar_embedding, nb_bars):
--> 131                     self.add_to_vocab(f"Bar_{i}")
    132                 self.max_bar_embedding = nb_bars
    133         current_bar = -1

[/usr/local/lib/python3.10/dist-packages/miditok/midi_tokenizer.py](https://localhost:8080/#) in add_to_vocab(self, token, vocab_idx, byte_, add_to_bpe_model)
    768             # For BPE
    769             if byte_ is None:
--> 770                 byte_ = chr(id_ + CHR_ID_START)
    771             self._vocab_base_id_to_byte[
    772                 id_

ValueError: chr() arg not in range(0x110000)

Thanks again for the great library!

Hi,

This error happens because the tokenizer is trying to add a new character / token to the vocabulary. This happens for Bar tokens, when the tokenizer is processing à MIDI with an overall duration that is longer that the nb of bars in the vocabulary.
But here 0x110000 (roughly 1M in decimal) is abnormally high (like a 1M bar long MIDI…). I think the MIDI file might be corrupted. I suggest you to preprocess the MIDI files, and cut each of them by chunks of $N$ bars.
Or alternatively, set max_bar_embedding=False when creating the tokenizer, to get rid of this issue

Hi @Natooz! As always, thanks for your quick response. Sounds great, I'll try what you mention. Would you know if there's an ready-to-use tool for creating the chunks? Otherwise, I'll try to create the code for this.

Thanks again.

Yes, you can take this snippet that I wrote some time ago

from pathlib import Path
from copy import deepcopy
from math import ceil

from miditoolkit import MidiFile
from tqdm import tqdm


MAX_NB_BAR = 20
MIN_NB_NOTES = 20
dataset = "MMD"

merged_out_dir = Path("data", f"{dataset}-chunked")
merged_out_dir.mkdir(parents=True, exist_ok=True)
midi_paths = list(Path("data", dataset).glob("**/*.mid"))

for i, midi_path in enumerate(tqdm(midi_paths, desc="CHUNKING MIDIS")):
    # Loads MIDI, merges and saves it
    midi = MidiFile(midi_path)
    ticks_per_cut = MAX_NB_BAR * midi.ticks_per_beat * 4
    nb_cuts = ceil(midi.max_tick / ticks_per_cut)
    if nb_cuts < 2:
        continue
    midis = [deepcopy(midi) for _ in range(nb_cuts)]

    for j, track in enumerate(midi.instruments):  # sort notes as they are not always sorted right
        track.notes.sort(key=lambda x: x.start)
        for midi_short in midis:  # clears notes from shorten MIDIs
            midi_short.instruments[j].notes = []
        for note in track.notes:
            cut_id = note.start // ticks_per_cut
            note_copy = deepcopy(note)
            note_copy.start -= cut_id * ticks_per_cut
            note_copy.end -= cut_id * ticks_per_cut
            midis[cut_id].instruments[j].notes.append(note_copy)

    # Saving MIDIs
    for j, midi_short in enumerate(midis):
        if sum(len(track.notes) for track in midi_short.instruments) < MIN_NB_NOTES:
            continue
        midi_short.dump(merged_out_dir / f"{midi_path.stem}_{j}.mid")

This issue is stale because it has been open for 30 days with no activity.