Vocab saving problem

Question

Vocab saving problem

Kapitan11 opened this issue a year ago · comments

Hey!

I'm wondering what the most proper way to save the vocabulary as well as the parameters but let me give you a little background:

Instead of using the tokenize_midi_dataset function that requires inputting the whole dataset, I've been tokenizing data one by one. Soon I realized that when saving parameters and loading them again I do have a vocabulary but a different one (to the one from the process of tokenization). To accommodate for the later use of MIDITokenizer.vocab, in the process of saving the parameters I added to additional_attributes _vocab_base (with assigned MIDITokenizer.vocab).

Now, there is no problem with single voc tokenizers however when I'm trying to load the parameters of multi-voc tokenizers (Octuple & MuMIDI) and I'm receiving the following error:

File [.../miditok/midi_tokenizer.py:1738], in MIDITokenizer.load_params(self, config_file_path)
   1736 elif key == "_vocab_base":
   1737     self._vocab_base = value
-> 1738     self.__vocab_base_inv = {v: k for k, v in value.items()}
   1739     continue
   1740 elif key == "_bpe_model":

AttributeError: 'list' object has no attribute 'items'

I see that when I'm adding key _vocab_base to additional_attributes in case of MuMIDI and Octuple the assigned value to that key is stored as a list of dicts, which then poses a problem for assignment of self.__vocab_base_inv.

Looks like rather a quick fix, but that makes me wonder if there is no other better way to save and retrieve the tokenizer's vocabulary.

Nathan Fradet · Answer 1 · Mon Jun 19 2023 23:21:52 GMT+0800 (China Standard Time)

Hello 👋,

Thank you for the bug report!

When saving a tokenizer, the original goal was to save the parameters (beat_res etc) so that they are only needed to rebuild the exact whole vocabulary. I guess there might be some bug here as you said that you ended with different vocals. :/
Do you have a simple code snippet so that I can reproduce the bug and find a fix ?

Kapitan11 · Answer 2 · Tue Jun 20 2023 21:29:14 GMT+0800 (China Standard Time)

Alright, I made a small .ipynb notebook to show the issue. What would be the best way to pass it to you?

Edit: I see I can attach the zipped file here. Nevermind :)

git_issue.ipynb.zip

Nathan Fradet · Answer 3 · Wed Jun 21 2023 03:44:38 GMT+0800 (China Standard Time)

Perfect thank you!
I'm quite busy and focus on other work right now, I'll look into it at next Monday/tuesday at the latest ! :)

Nathan Fradet · Answer 4 · Tue Jun 27 2023 00:28:08 GMT+0800 (China Standard Time)

Hey @Kapitan11 👋

As promised I took a look at your code snippet.
Good news is there is no "bug" within Miditok as the params loading method is intended to be used when creating the tokenizer :

new_tokenizer = miditok.REMI(params='params.json')

With this, the new tokenizer is the same as the the original one, and the vocabulary is rebuilt from the previous parameters as it should be.

Now, I must admit that the way to use this functionality is not really clear and not even well documented.
I'll make sure that the load_params method can be used properly anyway and update the doc to mention the current method when creating the tokenizer!

Kapitan11 · Answer 5 · Tue Jun 27 2023 00:47:51 GMT+0800 (China Standard Time)

Great! Thank you for clarification on that! I see that was a rookie mistake 😄

By the way, is there any way to support you? I'm heavily using the package and my master thesis that I should soon deliver is based on your and other contributors' effort. I'd gladly support the project to express my gratitude.

Nathan Fradet · Answer 6 · Tue Jun 27 2023 03:17:52 GMT+0800 (China Standard Time)

Thank you, that's very nice of you ! 😍

If you want, you can pick items from the todo list in the readme.
Another big improvement that I though of but didn't write there would be to flexibly allow to add Program tokens before note tokens, to intuitively handle multitrack for any tokenization, to form a single token sequence / stream, instead of the "one track = one token sequence" of most tokenizations here. But that might be a big step, that I might implement later when I'll have more time.

So if you are up to it, take anything you want, there is no priority, and with no pressure.
And If you desire you can also be maintainer. Right now I still find time to answers issues and fix some bugs, but there will probably a time when I won't be able.