Pretrained model which can be trained further

Question

Pretrained model which can be trained further

NagabhushanSN95 opened this issue 5 years ago · comments

Hi,
I'm trying to load the pretrained model (on S1M dataset) you've provided and train it further on another dataset (PENN) instead of starting from scratch. But when creating MCNET model, if I pass is_train=True, I get an error that the checkpoint doesn't have all the variables.
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint.

Can you kindly provide a pretrained model which can be loaded and trained further? Or can I make some changes to the code to achieve that?

Ruben Villegas · Answer 1 · Tue Sep 03 2019 01:15:20 GMT+0800 (China Standard Time)

Are you using the version of tensorflow I used? Tensorflow changed the way they name variables in their later versions and this is probably the case here. If you want to use a new version you have to change the variable names yourself as I don’t have a model trained with the newer versions. Regards, Ruben

On Sun, Sep 1, 2019 at 10:49 PM Nagabhushan S N ***@***.***> wrote: Hi, I'm trying to load the pretrained model (on S1M dataset) you've provided and train it further on another dataset (PENN) instead of starting from scratch. But when creating MCNET model, if I pass is_train=True, I get an error that the checkpoint doesn't have all the variables. NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Can you kindly provide a pretrained model which can be loaded and trained further? Or can I make some changes to the code to achieve that? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#12?email_source=notifications&email_token=ADM22HP3FBVXPWL5Z4DIWSLQHSSO5A5CNFSM4ISZYJB2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HIWNQNA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADM22HNBINFMUH2FJV2NFETQHSSO5ANCNFSM4ISZYJBQ> .

-- Regards, Ruben Villegas

Nagabhushan S N · Answer 2 · Tue Sep 03 2019 10:12:40 GMT+0800 (China Standard Time)

With the version of tensorflow I'm using, I was able to load your model to test. That is working perfectly fine. But when I load to train, that is when problems are arising.

Anyway, I'll cross check once the tensorflow version.

Ruben Villegas · Answer 3 · Tue Sep 03 2019 10:19:19 GMT+0800 (China Standard Time)

Ok cool. Thanks!

…

On Mon, Sep 2, 2019 at 7:12 PM Nagabhushan S N ***@***.***> wrote: With the version of tensorflow I'm using, I was able to load your model to test. That is working perfectly fine. But when I load to train, that is when problems are arising. Anyway, I'll cross check once the tensorflow version. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#12?email_source=notifications&email_token=ADM22HORZHMGVMJUI3HU7PTQHXBZRA5CNFSM4ISZYJB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WZ2MQ#issuecomment-527277362>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADM22HK6SAVA6FHTZIS42OTQHXBZRANCNFSM4ISZYJBQ> .

-- Regards, Ruben Villegas

Nagabhushan S N · Answer 4 · Fri Sep 06 2019 13:13:38 GMT+0800 (China Standard Time)

Hi, I tried with tensorflow_gpu-1.1.0. Still getting similar error.
For some reason, tensorflow installed with pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp27-none-linux_x86_64.whl was giving error when I import tensorflow

Python 2.7.16 |Anaconda, Inc.| (default, Aug 22 2019, 16:00:36) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import *
  File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 51, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 52, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 41, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcublas.so.8.0: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

So, I installed tensorflow-1.1.0 with conda
conda install tensorflow-gpu=1.1.0. It installed tensorflow-gpu=1.1.0=np111py27_0. Is this tensorflow version fine?
With this, import worked, but restore model didn't work here as well. Same error.

Here is a list of packages installed for reference.

name: MCnet2
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - ca-certificates=2019.6.16=hecc5488_0
  - certifi=2019.6.16=py27_1
  - cudatoolkit=7.5=2
  - cudnn=5.1=0
  - funcsigs=1.0.2=py_3
  - libblas=3.8.0=12_openblas
  - libcblas=3.8.0=12_openblas
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - liblapack=3.8.0=12_openblas
  - libopenblas=0.3.7=h6e990d7_1
  - libprotobuf=3.9.1=h8b12597_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - mock=3.0.5=py27_0
  - ncurses=6.1=he6710b0_1
  - openssl=1.1.1c=h516909a_0
  - pip=19.2.2=py27_0
  - protobuf=3.9.1=py27he1b5a44_0
  - python=2.7.16=h8b3fad2_5
  - readline=7.0=h7b6447c_5
  - setuptools=41.0.1=py27_0
  - sqlite=3.29.0=h7b6447c_0
  - tensorflow-gpu=1.1.0=np111py27_0
  - tk=8.6.8=hbc83047_0
  - werkzeug=0.15.5=py_0
  - wheel=0.33.4=py27_0
  - zlib=1.2.11=h7b6447c_3
  - pip:
    - backports-functools-lru-cache==1.5
    - cloudpickle==1.2.1
    - cycler==0.10.0
    - decorator==4.4.0
    - enum34==1.1.6
    - futures==3.3.0
    - imageio==2.5.0
    - joblib==0.13.2
    - kiwisolver==1.1.0
    - matplotlib==2.2.4
    - networkx==2.2
    - numpy==1.16.5
    - opencv-python==4.1.1.26
    - pillow==6.1.0
    - pyparsing==2.4.2
    - pyssim==0.4
    - python-dateutil==2.8.0
    - pytube==9.5.1
    - pytz==2019.2
    - pywavelets==1.0.3
    - scikit-image==0.14.4
    - scikit-video==1.1.11
    - scipy==1.2.2
    - six==1.12.0
    - subprocess32==3.5.4
prefix: /media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet2

After this, I tried upgrading the saved model with a script I found here: GitHub Tensorflow Issue. The file is https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py

Even after converting, restore_model didn't work. Do you know if this is the right conversion script? Or am I using a wrong one?

Can you please provide a list of all package and system requirements to restore the model (for continuing training) or can you point me to a script or documentation on how to convert the models you've provided to latest tensorflow version?

Ruben Villegas · Answer 5 · Sat Sep 07 2019 01:55:24 GMT+0800 (China Standard Time)

I wish I could walk you through fixing this, but I am a bit busy at the moment. What I would do if I was in your situation is: Look at the exact source of error (i.e. dig into Tensorflow). If it's something like variables with different names, then I would write a script to load the model, change the variable names, and save it again. Then I would load the newly saved model. Again, pinpoint the source of error and adapt your code for it. The weights are there, you just need to figure out how to use them. Hope this helps. Regards, Ruben

…

On Thu, Sep 5, 2019 at 10:13 PM Nagabhushan S N ***@***.***> wrote: Hi, I tried with tensorflow_gpu-1.1.0. Still getting similar error. For some reason, tensorflow installed with pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp27-none-linux_x86_64.whl was giving error when I import tensorflow Python 2.7.16 |Anaconda, Inc.| (default, Aug 22 2019, 16:00:36) [GCC 7.3.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module> from tensorflow.python import * File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 51, in <module> from tensorflow.python import pywrap_tensorflow File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 52, in <module> raise ImportError(msg) ImportError: Traceback (most recent call last): File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 41, in <module> from tensorflow.python.pywrap_tensorflow_internal import * File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module> _pywrap_tensorflow_internal = swig_import_helper() File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description) ImportError: libcublas.so.8.0: cannot open shared object file: No such file or directory Failed to load the native TensorFlow runtime. See https://www.tensorflow.org/install/install_sources#common_installation_problems for some common reasons and solutions. Include the entire stack trace above this error message when asking for help. So, I installed tensorflow-1.1.0 with conda conda install tensorflow-gpu=1.1.0. It installed tensorflow-gpu=1.1.0=np111py27_0. Is this tensorflow version fine? With this, import worked, but restore model didn't work here as well. Same error. Here is a list of packages installed for reference. name: MCnet2 channels: - conda-forge - defaults dependencies: - _libgcc_mutex=0.1=main - ca-certificates=2019.6.16=hecc5488_0 - certifi=2019.6.16=py27_1 - cudatoolkit=7.5=2 - cudnn=5.1=0 - funcsigs=1.0.2=py_3 - libblas=3.8.0=12_openblas - libcblas=3.8.0=12_openblas - libedit=3.1.20181209=hc058e9b_0 - libffi=3.2.1=hd88cf55_4 - libgcc-ng=9.1.0=hdf63c60_0 - libgfortran-ng=7.3.0=hdf63c60_0 - liblapack=3.8.0=12_openblas - libopenblas=0.3.7=h6e990d7_1 - libprotobuf=3.9.1=h8b12597_0 - libstdcxx-ng=9.1.0=hdf63c60_0 - mock=3.0.5=py27_0 - ncurses=6.1=he6710b0_1 - openssl=1.1.1c=h516909a_0 - pip=19.2.2=py27_0 - protobuf=3.9.1=py27he1b5a44_0 - python=2.7.16=h8b3fad2_5 - readline=7.0=h7b6447c_5 - setuptools=41.0.1=py27_0 - sqlite=3.29.0=h7b6447c_0 - tensorflow-gpu=1.1.0=np111py27_0 - tk=8.6.8=hbc83047_0 - werkzeug=0.15.5=py_0 - wheel=0.33.4=py27_0 - zlib=1.2.11=h7b6447c_3 - pip: - backports-functools-lru-cache==1.5 - cloudpickle==1.2.1 - cycler==0.10.0 - decorator==4.4.0 - enum34==1.1.6 - futures==3.3.0 - imageio==2.5.0 - joblib==0.13.2 - kiwisolver==1.1.0 - matplotlib==2.2.4 - networkx==2.2 - numpy==1.16.5 - opencv-python==4.1.1.26 - pillow==6.1.0 - pyparsing==2.4.2 - pyssim==0.4 - python-dateutil==2.8.0 - pytube==9.5.1 - pytz==2019.2 - pywavelets==1.0.3 - scikit-image==0.14.4 - scikit-video==1.1.11 - scipy==1.2.2 - six==1.12.0 - subprocess32==3.5.4 prefix: /media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet2 After this, I tried upgrading the saved model with a script I found here: GitHub Tensorflow Issue <tensorflow/tensorflow#11964 (comment)>. The file is https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py Even after converting, restore_model didn't work. Do you know if this is the right conversion script? Or am I using a wrong one? Can you please provide a list of all package and system requirements to restore the model (for continuing training) or can you point me to a script or documentation on how to convert the models you've provided to latest tensorflow version? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#12?email_source=notifications&email_token=ADM22HJMEEGSPJU7DHZHRXLQIHRIFA5CNFSM4ISZYJB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BX4AY#issuecomment-528711171>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADM22HPY55MKT6XW5QYT6A3QIHRIFANCNFSM4ISZYJBQ> .

-- Regards, Ruben Villegas

Nagabhushan S N · Answer 6 · Sat Sep 07 2019 09:37:27 GMT+0800 (China Standard Time)

Yeah. Thank you so much. I'll try that :)

Sharath · Answer 7 · Fri Dec 06 2019 16:35:10 GMT+0800 (China Standard Time)

Hi @NagabhushanSN95, did you able to do the training of the given s1m model? I am also facing the same issue.

Regards
Sharath

Nagabhushan S N · Answer 8 · Fri Dec 06 2019 21:08:48 GMT+0800 (China Standard Time)

@sharathyadav1993 I tried a bit. Couldn't figure it out. Got busy with other work. Will update here if I'm able to solve it.

Nagabhushan S N · Answer 9 · Sat Feb 01 2020 20:41:24 GMT+0800 (China Standard Time)

@sharathyadav1993 I tried as suggested in this StackOverflow answer. Worked like a charm. Posting the code here

# To port paper models to new tensorflow version
# Author: Nagabhushan S N
# Last Modified: 01/02/2020

from pathlib import Path

import tensorflow as tf

# Based on https://stackoverflow.com/a/57818431/3337089
from mcnet import MCNET


def port_model(model_path: Path, out_dir: Path):
    out_dir.mkdir(parents=True)
    save_path = out_dir / model_path.name

    with tf.Session() as sess:
        _ = MCNET(image_size=[240, 320], batch_size=8, K=4, T=7, c_dim=3, checkpoint_dir=None, is_train=True)
        tf.global_variables_initializer().run(session=sess)

        ckpt_vars = tf.train.list_variables(model_path.as_posix())
        ass_ops = []
        for dst_var in tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES):
            for (ckpt_var, ckpt_shape) in ckpt_vars:
                if dst_var.name.split(":")[0] == ckpt_var and dst_var.shape == ckpt_shape:
                    value = tf.train.load_variable(model_path.as_posix(), ckpt_var)
                    ass_ops.append(tf.assign(dst_var, value))

        # Assign the variables
        sess.run(ass_ops)
        saver = tf.train.Saver()
        saver.save(sess, save_path.as_posix())


def main():
    model_path = Path('../../PretrainedModels/PaperModels/S1M/MCNET.model-102502')
    out_dir = model_path.parent.parent / 'S1M_v1.13.1'
    port_model(model_path, out_dir)


if __name__ == '__main__':
    main()

My environment details as follows:

name: MCnet
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - absl-py=0.7.1=py37_0
  - astor=0.7.1=py_0
  - bzip2=1.0.8=h7b6447c_0
  - c-ares=1.15.0=h516909a_1001
  - ca-certificates=2019.11.28=hecc5488_0
  - cairo=1.14.12=h8948797_3
  - certifi=2019.11.28=py37_0
  - cloudpickle=1.2.1=py_0
  - cycler=0.10.0=py_1
  - cytoolz=0.10.0=py37h516909a_0
  - dask-core=2.2.0=py_0
  - decorator=4.4.0=py_0
  - fontconfig=2.13.0=h9420a91_0
  - freeglut=3.0.0=hf484d3e_5
  - freetype=2.9.1=h8a8886c_1
  - gast=0.2.2=py_0
  - glib=2.56.2=hd408876_0
  - graphite2=1.3.13=h23475e2_0
  - grpcio=1.16.1=py37hf8bcb03_1
  - h5py=2.8.0=py37h989c5e5_3
  - harfbuzz=1.8.8=hffaf4a1_0
  - hdf5=1.10.2=hba1933b_1
  - icu=58.2=h9c2bf20_1
  - imageio=2.5.0=py37_0
  - jasper=2.0.14=h07fcdf6_1
  - joblib=0.13.2=py_0
  - jpeg=9b=h024ee3a_2
  - keras-applications=1.0.7=py_1
  - keras-preprocessing=1.0.9=py_1
  - kiwisolver=1.1.0=py37hc9558a2_0
  - libblas=3.8.0=11_openblas
  - libcblas=3.8.0=11_openblas
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libglu=9.0.0=hf484d3e_1
  - liblapack=3.8.0=11_openblas
  - libopenblas=0.3.6=h6e990d7_6
  - libopus=1.3=h7b6447c_0
  - libpng=1.6.37=hbc83047_0
  - libprotobuf=3.9.1=h8b12597_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - libtiff=4.0.10=h2733197_2
  - libuuid=1.0.3=h1bed415_2
  - libvpx=1.7.0=h439df22_0
  - libxcb=1.13=h1bed415_1
  - libxml2=2.9.9=hea5a465_1
  - markdown=3.1.1=py_0
  - matplotlib-base=3.1.1=py37hfd891ef_0
  - mock=3.0.5=py37_0
  - ncurses=6.1=he6710b0_1
  - networkx=2.3=py_0
  - numpy=1.17.0=py37h95a1406_0
  - olefile=0.46=py_0
  - openssl=1.1.1d=h516909a_0
  - pandas=0.25.3=py37hb3f55d8_0
  - pcre=8.43=he6710b0_0
  - pillow=6.1.0=py37h34e0f95_0
  - pip=19.1.1=py37_0
  - pixman=0.38.0=h7b6447c_0
  - protobuf=3.9.1=py37he1b5a44_0
  - pyparsing=2.4.2=py_0
  - python=3.7.3=h0371630_0
  - python-dateutil=2.8.0=py_0
  - pytz=2019.3=py_0
  - pywavelets=1.0.3=py37hd352d35_1
  - readline=7.0=h7b6447c_5
  - scikit-image=0.15.0=py37hb3f55d8_2
  - scikit-learn=0.21.3=py37hcdab131_0
  - scikit-video=1.1.11=pyh24bf2e0_0
  - scipy=1.3.0=py37h921218d_1
  - setuptools=41.0.1=py37_0
  - six=1.12.0=py37_1000
  - sqlite=3.29.0=h7b6447c_0
  - tensorboard=1.13.1=py37_0
  - tensorflow=1.13.1=py37_0
  - tensorflow-estimator=1.13.0=py_0
  - termcolor=1.1.0=py_2
  - tk=8.6.9=hed695b0_1002
  - toolz=0.10.0=py_0
  - tornado=6.0.3=py37h516909a_0
  - werkzeug=0.15.5=py_0
  - wheel=0.33.4=py37_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.3.7=h0b5b093_0
  - pip:
    - imageio-ffmpeg==0.3.0
    - opencv-python==4.1.0.25
    - python-vlc==3.0.7110
    - ssim==0.3.0

Sharath · Answer 10 · Mon Feb 03 2020 16:52:28 GMT+0800 (China Standard Time)

@NagabhushanSN95 Thank you. I will check it.