Multi voc tokenizers decoding

Question

Multi voc tokenizers decoding

Kapitan11 opened this issue a year ago · comments

Hey!

I've encountered a problem and would like a second opinion on it. Note that this bug report is concerning the earlier miditok version 2.0.6.
EDIT: Just checked, on the newest version (2.1.1) the problem persists.

Right now I'm doing experiments with Transformer architecture and multi voc tokenizers (CPWord, Octuple and MuMIDI).

On every training epoch end, I'm calling a PyTorch Lightning Callback to generate some samples. Now, when I'm calling tokens_to_midi() I've encountered numerous problems.

These issues occurred while using MuMIDI tokenizer:
(Note that, for the processing the data I add PAD_None event to every token if the token length is not the max.)

1. If I transform the sequence to TokSequence and run complete_sequence() and then input it to the tokens_to_midi() then I'm getting this issue:

    [...]
    full_midi = self.tokenizer.tokens_to_midi(tok_seq_sequence, time_division=384) 
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/python3.11/site-packages/miditok/midi_tokenizer.py", line 79, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/python3.11/site-packages/miditok/tokenizations/mumidi.py", line 413, in tokens_to_midi
    tracks[current_track].append(
    ~~~~~~^^^^^^^^^^^^^^^
KeyError: 0

2. If I transform the sequence to a list of lists the error is the same as 0.

3. I wrote a function to filter out 0 after generation so that it resembles the way the MuMIDI tokenizer outputs the tokens originally (different length depending on the token type).
3a. I transform such list of lists to TokSequence and run complete_sequence() and then input it to the tokens_to_midi() similar to 1. I'm getting the same error (the error output is from the notebook though):

----> 1 tokenizer.tokens_to_midi(t)

File [...]/python3.11/site-packages/miditok/midi_tokenizer.py:79, in _in_as_seq..decorator..wrapper(*args, **kwargs)
     77 args = list(args)
     78 args[1] = seq
---> 79 return function(*args, **kwargs)

File [...]/python3.11/site-packages/miditok/tokenizations/mumidi.py:413, in MuMIDI.tokens_to_midi(self, tokens, _, output_path, time_division)
    410         vel = int(vel)
    411         duration = self._token_duration_to_ticks(duration, time_division)
--> 413         tracks[current_track].append(
    414             Note(vel, pitch, current_tick, current_tick + duration)
    415         )
    417 # Appends created notes to MIDI object
    418 for program, notes in tracks.items():

KeyError: 0

3b. I directly input the list of lists (with different lengths) to the token_to_midi() function, and I'm getting the following error:

----> 1 tokenizer.tokens_to_midi(removed)

File [...]/python3.11/site-packages/miditok/midi_tokenizer.py:78], in _in_as_seq..decorator..wrapper(*args, **kwargs)
     76     else:
     77         for seq_ in seq:
---> 78             self.complete_sequence(seq_)
     80 args = list(args)
     81 args[1] = seq

File [...]/python3.11/site-packages/miditok/midi_tokenizer.py:500], in MIDITokenizer.complete_sequence(self, seq)
    498     seq.tokens = [str(event) for event in seq.events]
    499 elif seq.ids is not None:
--> 500     seq.tokens = self._ids_to_tokens(seq.ids)
    501 elif seq.bytes is not None:
    502     seq.tokens = self._bytes_to_tokens(seq.bytes)

File [...]python3.11/site-packages/miditok/midi_tokenizer.py:552], in MIDITokenizer._ids_to_tokens(self, ids, as_str)
    549     return tokens
    551 for id_ in ids:
--> 552     event_str = self[id_]
    553     tokens.append(event_str if as_str else Event(*event_str.split("_")))
    554 return tokens
...
   1632         else self.__vocab_base_inv
   1633     )
   1634 return voc[item]

TypeError: list indices must be integers or slices, not NoneType

Edit: Here is the example_sequence.zip with the tokenizer parameters to get the error in detokenization.

Nathan Fradet · Answer 1 · Thu Jul 13 2023 15:22:14 GMT+0800 (China Standard Time)

Hi, thanks for reporting this bug ! 😃
Do you have a minimal code snippet that raises the issue, so that I can debug ?

I’ll look into it in a few days when I’ll have a computer

Kapitan11 · Answer 2 · Sun Jul 16 2023 21:14:23 GMT+0800 (China Standard Time)

I would gladly appreciate any help in solving this issue!

The attached zip file detokenize_multivocab_error.zip contains a .ipynb file with examples of errors for each multi-vocab tokenizer, as well as respective params and (generated) sequence that causes the errors.

Nathan Fradet · Answer 3 · Wed Jul 19 2023 22:57:54 GMT+0800 (China Standard Time)

Hi 👋,

As promised I took a look around.
It allowed me to fix a few decoding bugs.

Now It also seemed that some of the calls of notebook didn't send tokens with the good format, but I must admit that this was not very well documented.

Here is a new section that will be added in the documentation in the next commit, still in progress:

Depending on the tokenizer at use, the format of the tokens returned by the midi_to_tokens method may vary, as well as the expected format for the tokens_to_midi method. For any tokenizer, the format is the same for both methods.

The format is deduced from the is_multi_voc and unique_track tokenizer properties. In short: unique_track being True means that the tokenizer will convert a MIDI file into a single stream of tokens for all instrument tracks, otherwise it will convert each track to a distinct token stream; is_mult_voc being True means that each token stream is a list of lists of tokens, of shape (T,C) for T time steps and C subtokens per time step.

This results in four situations, where I is the number of tracks, T is the number of tokens (or time steps) and C the number of subtokens per time step:

is_multi_voc and unique_track are both False: (I,T)
is_multi_voc is False and unique_track is True: (T)
is_multi_voc is True and unique_track is False: (I,T,C)
is_multi_voc and unique_track are both True: (T,C)

Note that the I dimension will represent a TokSequence object for midi_to_tokens outputs.

Some tokenizer examples to illustrate:

TSD without config.use_programs will not have multiple vocabularies and will treat each MIDI track as a unique stream of tokens, hence it will convert MIDI files to a list of TokSequence objects, (I,T) format.
TSD with config.use_programs being True will convert all MIDI tracks to a single stream of tokens, hence one TokSequence object, (T) format.
CPWord is a multi-voc tokenizer and treats each MIDI track as a distinct stream of tokens, hence it will convert MIDI files to a list of TokSequence objects with (I,T,C) format.
Octuple is a multi-voc tokenizer and converts all MIDI track to a single stream of tokens, hence it will convert MIDI files to a TokSequence object, (T,C) format.

Here is the code from the notebook with some fixes / tweaks as the json files seem to not have been saved with the expected formats

from pathlib import Path
from copy import deepcopy

import miditok


mumidi = miditok.MuMIDI(params='2/mumidi_params.json')
print('MuMIDI: ', mumidi)
octuplemono = miditok.OctupleMono(params='2/octuple_params.json')
print('Octuple Mono: ', octuplemono)
cpword = miditok.CPWord(params='2/cpword_params.json')
print('CP Word: ', cpword)


"""from miditoolkit import MidiFile
midi = MidiFile("tests/Multitrack_MIDIs/All The Small Things.mid")
toto = octuplemono.midi_to_tokens(midi)
octuplemono.save_tokens(toto, "toto.json")
data_octuplemono = octuplemono.load_tokens("toto.json")"""

data_mumidi = mumidi.load_tokens(path=Path('2/mumidi_error_detokenizing_sequence.json'))
print('MuMIDI: ')
print(data_mumidi['ids'][:5])
print()
data_octuplemono = octuplemono.load_tokens(path=Path('2/octuple_error_detokenizing_sequence.json'))
print('Octuple Mono: ')
print(data_octuplemono['ids'][:5])
print()
data_cpword = cpword.load_tokens(path=Path('2/cpword_error_detokenizing_sequence.json'))
print('CP Word: ')
print(data_cpword['ids'][:5])
print()


tokSeq_mumidi = miditok.TokSequence(ids=data_mumidi['ids'])
tokSeq_completed_mumidi = deepcopy(tokSeq_mumidi)
mumidi.complete_sequence(tokSeq_completed_mumidi)

tokSeq_octuplemono=[miditok.TokSequence(ids=data_octuplemono['ids'])]
tokSeq_completed_octuplemono = deepcopy(tokSeq_octuplemono)
for seq in tokSeq_completed_octuplemono:
    octuplemono.complete_sequence(seq)
data_octuplemono = [data_octuplemono['ids']]

tokSeq_cpword=[miditok.TokSequence(ids=data_cpword['ids'])]
tokSeq_completed_cpword = deepcopy(tokSeq_cpword)
for seq in tokSeq_completed_cpword:
    cpword.complete_sequence(seq)
data_cpword = [data_cpword['ids']]


# List of lists
mumidi.tokens_to_midi(data_mumidi['ids'])
octuplemono.tokens_to_midi(data_octuplemono)
cpword.tokens_to_midi(data_cpword)

# TokSequence
mumidi.tokens_to_midi(tokSeq_mumidi)
octuplemono.tokens_to_midi(tokSeq_octuplemono)
cpword.tokens_to_midi(tokSeq_cpword)

# Completed TokSequence
mumidi.tokens_to_midi(tokSeq_completed_mumidi)
octuplemono.tokens_to_midi(tokSeq_completed_octuplemono)
cpword.tokens_to_midi(tokSeq_completed_cpword)

I'm still working on it

Nathan Fradet · Answer 4 · Thu Jul 20 2023 17:06:52 GMT+0800 (China Standard Time)

The last commit (089fa74) brings a few modifications in the in_as_seq decorator (especially handling MuMIDI ids as lists of ints), as well as a few fixes for i/o for the tokens_to_midi and tokens_to_track methods.

The code from my message above still produce errors:

for MuMIDI because in the tokens a Pitch / Velocity / Duration token happens before a Program one;
for CPWord because the token doesn't seem to be well generated, some tokens have 2 or 3 sub-tokens whereas 5 are expected.

For MuMIDI, we can set a default Program in case none is found for the first note. For CPWord I guess you should check again how the tokens are generated in your code.

github-actions · Answer 5 · Sun Aug 20 2023 09:59:12 GMT+0800 (China Standard Time)

This issue is stale because it has been open for 30 days with no activity.

github-actions · Answer 6 · Sun Sep 03 2023 10:02:19 GMT+0800 (China Standard Time)

This issue was closed because it has been inactive for 14 days since being marked as stale.