YapengTian / AVE-ECCV18

Audio-Visual Event Localization in Unconstrained Videos, ECCV 2018

Home Page:https://sites.google.com/view/audiovisualresearch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failed to download audio_feature.h5

asker-github opened this issue · comments

Does anyone have the link of Chinese Baidu disk or thunder of audio_feature.h5? I can only download it with Google browser. Because it's too big, I fail every time.

I try to make audio_feature.h5 myself, but I don't know if it will have any bad effect.

I uploaded it to Dropbox. Here is the link: https://www.dropbox.com/s/djweo9ew9pqv8xi/audio_feature.h5?dl=0.

I uploaded it to Dropbox. Here is the link: https://www.dropbox.com/s/djweo9ew9pqv8xi/audio_feature.h5?dl=0.

Oops, that's my problem. It should be visual_feature.h5. I was so excited that I typed the wrong file name.

I tried to download your link. The speed should be similar to the links in readme. Because I'm using chrome to download, even if you upload to Dropbox, I may still fail to download. Every time I download half of it, it will fail. Maybe the network is not very good.
haha. Maybe I have to generate the file myself. thank you.

The file size I generated is 8.3G. It is generated according to the video name in each line of the Annotations.txt. But you provide 7.7g, I don't know what the difference is.

If you used the provided scripts and followed the order of Annotations.txt, it should be correct.

If you used the provided scripts and followed the order of Annotations.txt, it should be correct.

Hello, my torch version is 1.5.1.When I tested(python supervised_main.py --model_name AV_att), this error occurred.
Traceback (most recent call last):
File "supervised_main.py", line 159, in
test(args)
File "supervised_main.py", line 148, in test
x_labels = model(audio_inputs, video_inputs)
File "/home/zhu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/zhu/zhu_tf/audio_visual/AVE-ECCV18-master/models.py", line 66, in forward
self.lstm_video.flatten_parameters()
File "/home/zhu/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 106, in flatten_parameters
if len(self._flat_weights) != len(self._flat_weights_names):
File "/home/zhu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 594, in getattr
type(self).name, name))
AttributeError: 'LSTM' object has no attribute '_flat_weights'`
I'm trying to fix this mistake right now. I want to know if I can continue to train or test on the model you provided if I solve this error.

I was using 0.30. If you run it using 1.5.1, I think you need to modify code accordingly.

I was using 0.30. If you run it using 1.5.1, I think you need to modify code accordingly.

Hello, first of all, thank you for your kind reply.I have two questions for you. ^_^

weak_ supervised_ main.py :visual_ feature_ noisy.h5、audio_ feature_ noisy.h5、mil_ labels.h5、labels_ noisy.h5、
In addition, I don't know whether these documents correspond to Annotations.txt. Because I want to study several other classes.

cmm_ train.py :labels_ closs.h5,visual_ feature_ vec.h5,train_ order_ match.h5,val_ order_ match.h5,test_ order_ match.h5
Besides, I have no idea what these documents are. visual_ feature_ vec.h5 is not available for download, and I'm also upset. Looking forward to your reply, thank you!

The noisy features are from some randomly selected videos which are in the background class. They do not correspond to Annotations.txt. The videos can be found https://drive.google.com/file/d/1Iqba9lk_KOxxf5CFV33_XVoC5nuG8wiu/. The mil_labels are video-level labels.

As given in the Readme, visual_ feature_ vec.h5 can be downloaded from https://drive.google.com/file/d/1l-c8Kpr5SZ37h-NpL7o9u8YXBNVlX_Si/view. labels_ closs.h5 contains labels for the contrastive loss. visual_ feature_ vec.h5 contains visual features. The other three are data splitting orders.

I retrain the Male speech, Female speech and background on your supervised tasks. The accuracy is about the same as yours, but the effect of finding the sounding part in the picture is very bad (python attention_visualization.py). How can I achieve the effect in your paper?

I used data from different categories to train the model before. Since you only use the limited speech data, it is reasonable that the model fails to find the sounding parts for objects in other categories.

If you only want to explore face-speech data, you might train the model on a large set with only human talking videos such as active speaker detection dataset: https://arxiv.org/abs/1901.01342.

I used data from different categories to train the model before. Since you only use the limited speech data, it is reasonable that the model fails to find the sounding parts for objects in other categories.

I'm just training these three categories. I just want to identify these three categories. But the effect is not good. Maybe it's because there are fewer types of training. Thank you for your recommendation.