pannous / caffe-speech-recognition

Speech Recognition with the Caffe deep learning framework, migrating to

Home Page:https://github.com/pannous/tensorflow-speech-recognition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does not work "spoken numbers" example

aspats opened this issue · comments

Hi @pannous ,

I happy to find example like yours with audio classification. But I see that you need to update your code because it has some problems.

For now I am trying to use "training spoken numbers" example and I found doubts/problems:

  1. In file "numbers_solver.prototxt" you are using net: "numbers_net.autoencoder.prototxt". In "numbers_net.autoencoder.prototxt" are defined training and testing lists files ("train_index_256x256.txt", "test_index_256x256.txt"), but those files does not exist. But I fixed in "numbers_solver.prototxt" file net: "numbers_net.prototxt" . After that step I could start to created caffe model.

  2. When I tried to run backend server with "recognition-server.py", I got it:
    ... net = caffe.Net(model, weights)
    Traceback (most recent call last):
    File "", line 2, in
    Boost.Python.ArgumentError: Python argument types in
    Net.init(Net, str, str)
    did not match C++ signature:
    init(boost::python::api::object, std::string, std::string, int)
    init(boost::python::api::object, std::string, int)

  3. And it is not clear in some code you are using original size of images 512x512 and in another code you are reducing size 256x256. Because now I used original images to create model, but in code part "recognition-server.py" and "rocord.py" you are transforming image.

  4. And would like to get original audio files of "spoken numbers" and I want to know how did you made from wav to png?

I will be happy to get answer from you. I really like your audio classification example, just I think you need to update it.

Thanks!

Hi,

I do have the same/similar issue.
Yesterday I

  • freshly cloned caffe and caffe-speech-recognition from git,
  • built caffe,
  • downloaded http://pannous.net/spoken_numbers.tar and extracted into the caffe-speech-recognition root directory
  • started ./train.sh and stumbled across issue 1) of my previous poster.

After implementing above fixes I now get the Issue from this thread: #1 :

[...]
I0816 14:41:04.538826 3856 layer_factory.hpp:77] Creating layer alpha
I0816 14:41:04.538861 3856 net.cpp:100] Creating Layer alpha
I0816 14:41:04.538871 3856 net.cpp:408] alpha -> data
I0816 14:41:04.538889 3856 net.cpp:408] alpha -> label
I0816 14:41:04.538908 3856 image_data_layer.cpp:38] Opening file train_index.txt
I0816 14:41:04.539526 3856 image_data_layer.cpp:58] A total of 2049 images.
E0816 14:41:04.539546 3856 io.cpp:80] Could not open or find file spoken_numbers/3_Princess_220.wav.png 3
F0816 14:41:04.539655 3856 image_data_layer.cpp:72] Check failed: cv_img.data Could not load spoken_numbers/3_Princess_220.wav.png 3
[...]

Looks to me as if the data/label info line is not split properly.The file is definitely there.
Is this an issue with the version of caffe being too recent / handling the index file differently? If this is the case: Which version of caffe would be known to work with your setup?

Cheers,
Sebastian

Hi, this demo code is two years old, updating the code or data to the current caffe version / requirements shouldn't be too hard though.

Hi pannous,

first let me thank you for your swift reply yesterday.

I went (for now) the lazy way by running caffe-rc2 from https://github.com/BVLC/caffe/archive/rc2.zip
and modifying numbers_solver.prototxt such that numbers_net.prototxt is used (just switch comment/uncomment in lines 2 and 3).
The latter is missing training data and index files.

This seems to work (it is training).

I also found another way around the "3_Princess_220.wav.png file not found" error. I did what Sebastian did and edited numbers_solver.prototxt by uncommenting/commenting lines 2 and 3 so that numbers_net.prototxt is used.

I also edited train_index.txt and test_index.txt and removed all the tabs and replaced them with a whitespace. So the first line of train_index.txt will be "/spoken_numbers/3_Princess_220.wav.png 3" and the line after that will be "/spoken_numbers/6_Allison_60.wav.png 6" etc...

After that everything seems to be working.