wzk1015 / video-bgm-generation

[ACM MM 2021 Best Paper Award] Video Background Music Generation with Controllable Music Transformer

Home Page:https://wzk1015.github.io/cmt/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bugs encountered while using the inference code "gen_midi_conditional.py" in "src/" folder

shansongliu opened this issue · comments

Hi, I encountered some bugs while using the "gen_midi_conditional.py" code to generate midi files for a given video. I installed the Python 2 environment given the requirement file "py2_requirements.txt" and then used the "video2npz.sh" to produce a "xxx.npz" file for the given video. But I encountered some problems while using the "gen_midi_conditional.py" code, the program output and error report are pasted below:

Command I used:
python3 gen_midi_conditional.py -f ../inference/LGpwmBqJF1Q_HarryPotter2ChamberOfSecrets.npz -c ../exp/train_exp/loss_70_params.pt

Code standard print:
inference
D_MODEL 512 N_LAYER 12 N_HEAD 8 DECODER ATTN causal-linear
[18, 3, 18, 129, 18, 6, 27, 102, 5025]
[*] load model from: ../exp/train_exp/loss_70_params.pt
new song
[vlog_npz matrix print here]
------ initiate ------
tensor([[[17, 1, 10, 0, 0, 0, 0, 1, 0]]])

Error print:
Traceback (most recent call last):
File "gen_midi_conditional.py", line 104, in
generate()
File "gen_midi_conditional.py", line 85, in generate
res, err_note_number_list, err_beat_number_list = net(is_train=False, vlog=vlog_npz, C=0.7)
File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs, **kwargs)
File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in call_impl
return forward_call(*input, **kwargs)
File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/model.py", line 483, in forward
return self.inference_from_scratch(**kwargs)
File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/model.py", line 341, in inference_from_scratch
h, y_type = self.forward_hidden(input
, is_training=False, init_token=pre_init)
File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/model.py", line 216, in forward_hidden
init_emb_linear = self.forward_init_token(init_token)
File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/model.py", line 162, in forward_init_token
emb_genre = self.init_emb_genre(x[..., 0])
File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/utils.py", line 80, in forward
return self.lut(x) * math.sqrt(self.d_model)
File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

The inference code, trained model and data (including original video and processed .npz file) are attached in Google drive. Here is the link:
https://drive.google.com/drive/folders/1Ch3jjxZrztKAtEvuEhGjxPk2-G0NSYe0?usp=sharing

Could you help me check this? Really appreciate it.

Best regards,

pre_init in model.py are the init tokens for genre(first column), key(unused, second column) and instrument(third column). In your gen_midi_conditional.py you define the embedding size of them as init_n_token = [1, 1, 1] in line 48, so pre_init is out of range.

You can fix it by :

  • setting init_n_token = [7, 1, 6] in gen_midi_conditional.py (if init_n_token of your trained model is this)
  • or changing pre_init in model.py as empty array np.array([]) (if init_n_token of your trained model is [1,1,1])

Do you mean I should set the pre_init variable ( pre_init = np.array([[5, 0, 0], [0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3], [0, 0, 4], [0, 0, 5]]) ) as pre_init = np.array([])? I see in the train.py, the value of init_n_token is [1,1,1]

Yes, set pre_init as np.array([]). You can also try pre_init = np.array([[0, 0, 0]]) if np.array([]) doesn't work well

init_n_token is not the token itself, but the number of embedding classes for genre, key and instrument.

Thanks for your quick reply. After I set pre_init as np.array([[0, 0, 0]]), the inference program can run without no more error message output (Set pre_init as np.array([]) still triggers error). But what makes me feel strange is that the inference program does not seem to stop. It has run about 8 hours after launch for the 2min input video. I wonder is this normal? By the way, I haven't seen a midi output yet. Will the midi file be generated in the src/ folder? Thanks again for your patience.

That seems weird. Normally it runs for several minutes for a short video, and stop generating automatically with Beat Timing Encoding. Or it will break from the loop if music length exceed video length (see this).

I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

As stated in README / Directory Structure, the generated midi files will be stored in the inference/ folder.

I followed the README instruction and it runs normally. Here are some of the generated and intermediate files: https://drive.google.com/drive/folders/1UtZXXLiY9PNFo-p3lIslQxlEKCQzIcnU?usp=sharing.

I followed the README instruction and it runs normally. Here are some of the generated and intermediate files: https://drive.google.com/drive/folders/1UtZXXLiY9PNFo-p3lIslQxlEKCQzIcnU?usp=sharing.

Hi, Shangzhe, it seems that the link needs access permission. I have already sent an access permission application. BTW, I indeed followed the detailed instruction provided by the README.md. But as I stated, the inference program could not stop (seems like it ran into an infinite loop) after I corrected the pre_init variable advised by Zhaokai. Did you use the video data, trained model and inference code in this link https://drive.google.com/drive/folders/1Ch3jjxZrztKAtEvuEhGjxPk2-G0NSYe0?usp=sharing and successfully generate midi files?

I used your video, our model, and inference code in this repo without any modification.
Perhaps your inference code or model has problems.

That seems weird. Normally it runs for several minutes for a short video, and stop generating automatically with Beat Timing Encoding. Or it will break from the loop if music length exceed video length (see this).

I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Hi, Zhaokai, could you be more specific about what might be wrong when I use the video2npz pipeline? I followed the inference instruction in README.md. I saw there are three sub-steps in the video2npz.sh script. For the first sub-step optical_flow.py, the optical flow npz file was generated. Then for the second sub-step video2metadata.py, a json file was generated. The last sub-step metadata2numpy_mix.py generated a npz data file according to the last-sub-step-generated json file.

Then I used this npz data file together with my self-trained model and also the gen_midi_conditional.py in which the decoder_n_class and init_n_token variables were changed in line with the training data (output by the train.py file). After all these done, the inference program gen_midi_conditional.py can actually run, but the only problem is that it seemed that it ran into an infinite loop.

For your mentioned points:

  1. I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

I am not quite sure about the video length you mentioned. Do you mean the number of the video frames? Or the dimension of the vlog_npz variable in gen_midi_conditional.py?

  1. For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Could you clarify which line (or which variable) in the source code you are referring to?

Again, many thanks for your patience and kindness. I really appreciate it.

I used your video, our model, and inference code in this repo without any modification. Perhaps your inference code or model has problems.

Thanks for your clarification.

That seems weird. Normally it runs for several minutes for a short video, and stop generating automatically with Beat Timing Encoding. Or it will break from the loop if music length exceed video length (see this).
I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.
For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Hi, Zhaokai, could you be more specific about what might be wrong when I use the video2npz pipeline? I followed the inference instruction in README.md. I saw there are three sub-steps in the video2npz.sh script. For the first sub-step optical_flow.py, the optical flow npz file was generated. Then for the second sub-step video2metadata.py, a json file was generated. The last sub-step metadata2numpy_mix.py generated a npz data file according to the last-sub-step-generated json file.

Then I used this npz data file together with my self-trained model and also the gen_midi_conditional.py in which the decoder_n_class and init_n_token variables were changed in line with the training data (output by the train.py file). After all these done, the inference program gen_midi_conditional.py can actually run, but the only problem is that it seemed that it ran into an infinite loop.

For your mentioned points:

  1. I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

I am not quite sure about the video length you mentioned. Do you mean the number of the video frames? Or the dimension of the vlog_npz variable in gen_midi_conditional.py?

  1. For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Could you clarify which line (or which variable) in the source code you are referring to?

Again, many thanks for your patience and kindness. I really appreciate it.

  1. you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed

  2. see the output when running inference, it should be like this

[   9   1   6   0   0   3   4  35 216]
[   3   1  10   0   0   5   1  36 226]
[   0   2   0  74  16   5   0  36 226]

the second row from the right (35,36,36) indicates pbeat

That seems weird. Normally it runs for several minutes for a short video, and stop generating automatically with Beat Timing Encoding. Or it will break from the loop if music length exceed video length (see this).
I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.
For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Hi, Zhaokai, could you be more specific about what might be wrong when I use the video2npz pipeline? I followed the inference instruction in README.md. I saw there are three sub-steps in the video2npz.sh script. For the first sub-step optical_flow.py, the optical flow npz file was generated. Then for the second sub-step video2metadata.py, a json file was generated. The last sub-step metadata2numpy_mix.py generated a npz data file according to the last-sub-step-generated json file.
Then I used this npz data file together with my self-trained model and also the gen_midi_conditional.py in which the decoder_n_class and init_n_token variables were changed in line with the training data (output by the train.py file). After all these done, the inference program gen_midi_conditional.py can actually run, but the only problem is that it seemed that it ran into an infinite loop.
For your mentioned points:

  1. I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

I am not quite sure about the video length you mentioned. Do you mean the number of the video frames? Or the dimension of the vlog_npz variable in gen_midi_conditional.py?

  1. For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Could you clarify which line (or which variable) in the source code you are referring to?
Again, many thanks for your patience and kindness. I really appreciate it.

  1. you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed
  2. see the output when running inference, it should be like this
[   9   1   6   0   0   3   4  35 216]
[   3   1  10   0   0   5   1  36 226]
[   0   2   0  74  16   5   0  36 226]

the second row from the right (35,36,36) indicates pbeat

Thanks for your detailed explanation, I will continue to check.

you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed

I checked the value of n_beat and len(vlog). They are not equal, n_beat=940 > len(vlog)=166. And the value of cur_vlog gets stuck at 14 and never proceeds. Does this mean the input npz file of the inference code gen_midi_conditional.py is corrputed?

you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed

I checked the value of n_beat and len(vlog). They are not equal, n_beat=940 > len(vlog)=166. And the value of cur_vlog gets stuck at 14 and never proceeds. Does this mean the input npz file of the inference code gen_midi_conditional.py is corrputed?

n_beat > len(vlog) is normal, the former represents total number of beats, the latter represents Bar and Beat tokens. Can you provide the standard output of inference?

you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed

I checked the value of n_beat and len(vlog). They are not equal, n_beat=940 > len(vlog)=166. And the value of cur_vlog gets stuck at 14 and never proceeds. Does this mean the input npz file of the inference code gen_midi_conditional.py is corrputed?

n_beat > len(vlog) is normal, the former represents total number of beats, the latter represents Bar and Beat tokens. Can you provide the standard output of inference?

I put the newly generated standard output (stdout_new.txt) in this link https://drive.google.com/drive/folders/1Ch3jjxZrztKAtEvuEhGjxPk2-G0NSYe0

For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Hi, Zhaokai, I observe that my pbeat attribute will be stuck to a number (say 5 or 14) and does not increase any longer when performing inference. I think this is the reason why the loop cannot stop. Do you have any idea why this can happen?

It seems that this is due to the inconsistency of init tokens between train and generate, and will appear when using another training set. This should be fixed by 8f79229

It seems that this is due to the inconsistency of init tokens between train and generate, and will appear when using another training set. This should be fixed by 8f79229

Thanks, will try it.

It seems that this is due to the inconsistency of init tokens between train and generate, and will appear when using another training set. This should be fixed by 8f79229

Thanks, will try it.

I tried the modified version, now it gives the following error. It seems that it still has the dimension problem.

Traceback (most recent call last):
File "train.py", line 226, in
train_dp()
File "train.py", line 169, in train_dp
losses = net(is_train=True, x=batch_x, target=batch_y, loss_mask=batch_mask, init_token=batch_init)
File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 482, in forward
return self.train_forward(**kwargs)
File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 450, in train_forward
h, y_type = self.forward_hidden(x, memory=None, is_training=True, init_token=init_token)
File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 213, in forward_hidden
encoder_pos_emb = torch.cat([init_emb_linear, encoder_pos_emb], dim=1)
RuntimeError: Tensors must have same number of dimensions: got 2 and 3

It seems that this is due to the inconsistency of init tokens between train and generate, and will appear when using another training set. This should be fixed by 8f79229

Thanks, will try it.

I tried the modified version, now it gives the following error. It seems that it still has the dimension problem.

Traceback (most recent call last): File "train.py", line 226, in train_dp() File "train.py", line 169, in train_dp losses = net(is_train=True, x=batch_x, target=batch_y, loss_mask=batch_mask, init_token=batch_init) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], **kwargs[0]) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 482, in forward return self.train_forward(**kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 450, in train_forward h, y_type = self.forward_hidden(x, memory=None, is_training=True, init_token=init_token) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 213, in forward_hidden encoder_pos_emb = torch.cat([init_emb_linear, encoder_pos_emb], dim=1) RuntimeError: Tensors must have same number of dimensions: got 2 and 3

I just downloaded the newest version of this repo and directly used the train.py there without further modification.

A typo, just fix it by d4a6c33, you can try the latest version

A typo, just fix it by d4a6c33, you can try the latest version

It can run now. Thanks. Will check the inference later.