Pretrained model which can be trained further
NagabhushanSN95 opened this issue · comments
Hi,
I'm trying to load the pretrained model (on S1M dataset) you've provided and train it further on another dataset (PENN) instead of starting from scratch. But when creating MCNET
model, if I pass is_train=True
, I get an error that the checkpoint doesn't have all the variables.
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint.
Can you kindly provide a pretrained model which can be loaded and trained further? Or can I make some changes to the code to achieve that?
With the version of tensorflow I'm using, I was able to load your model to test. That is working perfectly fine. But when I load to train, that is when problems are arising.
Anyway, I'll cross check once the tensorflow version.
Hi, I tried with tensorflow_gpu-1.1.0
. Still getting similar error.
For some reason, tensorflow installed with pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp27-none-linux_x86_64.whl
was giving error when I import tensorflow
Python 2.7.16 |Anaconda, Inc.| (default, Aug 22 2019, 16:00:36)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import *
File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 51, in <module>
from tensorflow.python import pywrap_tensorflow
File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 52, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 41, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet3/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcublas.so.8.0: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/install_sources#common_installation_problems
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
So, I installed tensorflow-1.1.0 with conda
conda install tensorflow-gpu=1.1.0
. It installed tensorflow-gpu=1.1.0=np111py27_0
. Is this tensorflow version fine?
With this, import worked, but restore model didn't work here as well. Same error.
Here is a list of packages installed for reference.
name: MCnet2
channels:
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- ca-certificates=2019.6.16=hecc5488_0
- certifi=2019.6.16=py27_1
- cudatoolkit=7.5=2
- cudnn=5.1=0
- funcsigs=1.0.2=py_3
- libblas=3.8.0=12_openblas
- libcblas=3.8.0=12_openblas
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- liblapack=3.8.0=12_openblas
- libopenblas=0.3.7=h6e990d7_1
- libprotobuf=3.9.1=h8b12597_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- mock=3.0.5=py27_0
- ncurses=6.1=he6710b0_1
- openssl=1.1.1c=h516909a_0
- pip=19.2.2=py27_0
- protobuf=3.9.1=py27he1b5a44_0
- python=2.7.16=h8b3fad2_5
- readline=7.0=h7b6447c_5
- setuptools=41.0.1=py27_0
- sqlite=3.29.0=h7b6447c_0
- tensorflow-gpu=1.1.0=np111py27_0
- tk=8.6.8=hbc83047_0
- werkzeug=0.15.5=py_0
- wheel=0.33.4=py27_0
- zlib=1.2.11=h7b6447c_3
- pip:
- backports-functools-lru-cache==1.5
- cloudpickle==1.2.1
- cycler==0.10.0
- decorator==4.4.0
- enum34==1.1.6
- futures==3.3.0
- imageio==2.5.0
- joblib==0.13.2
- kiwisolver==1.1.0
- matplotlib==2.2.4
- networkx==2.2
- numpy==1.16.5
- opencv-python==4.1.1.26
- pillow==6.1.0
- pyparsing==2.4.2
- pyssim==0.4
- python-dateutil==2.8.0
- pytube==9.5.1
- pytz==2019.2
- pywavelets==1.0.3
- scikit-image==0.14.4
- scikit-video==1.1.11
- scipy==1.2.2
- six==1.12.0
- subprocess32==3.5.4
prefix: /media/nagabhushan/Data02/SoftwareFiles/Anaconda/anaconda3/envs/MCnet2
After this, I tried upgrading the saved model with a script I found here: GitHub Tensorflow Issue. The file is https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py
Even after converting, restore_model
didn't work. Do you know if this is the right conversion script? Or am I using a wrong one?
Can you please provide a list of all package and system requirements to restore the model (for continuing training) or can you point me to a script or documentation on how to convert the models you've provided to latest tensorflow version?
Yeah. Thank you so much. I'll try that :)
Hi @NagabhushanSN95, did you able to do the training of the given s1m model? I am also facing the same issue.
Regards
Sharath
@sharathyadav1993 I tried a bit. Couldn't figure it out. Got busy with other work. Will update here if I'm able to solve it.
@sharathyadav1993 I tried as suggested in this StackOverflow answer. Worked like a charm. Posting the code here
# To port paper models to new tensorflow version
# Author: Nagabhushan S N
# Last Modified: 01/02/2020
from pathlib import Path
import tensorflow as tf
# Based on https://stackoverflow.com/a/57818431/3337089
from mcnet import MCNET
def port_model(model_path: Path, out_dir: Path):
out_dir.mkdir(parents=True)
save_path = out_dir / model_path.name
with tf.Session() as sess:
_ = MCNET(image_size=[240, 320], batch_size=8, K=4, T=7, c_dim=3, checkpoint_dir=None, is_train=True)
tf.global_variables_initializer().run(session=sess)
ckpt_vars = tf.train.list_variables(model_path.as_posix())
ass_ops = []
for dst_var in tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES):
for (ckpt_var, ckpt_shape) in ckpt_vars:
if dst_var.name.split(":")[0] == ckpt_var and dst_var.shape == ckpt_shape:
value = tf.train.load_variable(model_path.as_posix(), ckpt_var)
ass_ops.append(tf.assign(dst_var, value))
# Assign the variables
sess.run(ass_ops)
saver = tf.train.Saver()
saver.save(sess, save_path.as_posix())
def main():
model_path = Path('../../PretrainedModels/PaperModels/S1M/MCNET.model-102502')
out_dir = model_path.parent.parent / 'S1M_v1.13.1'
port_model(model_path, out_dir)
if __name__ == '__main__':
main()
My environment details as follows:
name: MCnet
channels:
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- absl-py=0.7.1=py37_0
- astor=0.7.1=py_0
- bzip2=1.0.8=h7b6447c_0
- c-ares=1.15.0=h516909a_1001
- ca-certificates=2019.11.28=hecc5488_0
- cairo=1.14.12=h8948797_3
- certifi=2019.11.28=py37_0
- cloudpickle=1.2.1=py_0
- cycler=0.10.0=py_1
- cytoolz=0.10.0=py37h516909a_0
- dask-core=2.2.0=py_0
- decorator=4.4.0=py_0
- fontconfig=2.13.0=h9420a91_0
- freeglut=3.0.0=hf484d3e_5
- freetype=2.9.1=h8a8886c_1
- gast=0.2.2=py_0
- glib=2.56.2=hd408876_0
- graphite2=1.3.13=h23475e2_0
- grpcio=1.16.1=py37hf8bcb03_1
- h5py=2.8.0=py37h989c5e5_3
- harfbuzz=1.8.8=hffaf4a1_0
- hdf5=1.10.2=hba1933b_1
- icu=58.2=h9c2bf20_1
- imageio=2.5.0=py37_0
- jasper=2.0.14=h07fcdf6_1
- joblib=0.13.2=py_0
- jpeg=9b=h024ee3a_2
- keras-applications=1.0.7=py_1
- keras-preprocessing=1.0.9=py_1
- kiwisolver=1.1.0=py37hc9558a2_0
- libblas=3.8.0=11_openblas
- libcblas=3.8.0=11_openblas
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- libglu=9.0.0=hf484d3e_1
- liblapack=3.8.0=11_openblas
- libopenblas=0.3.6=h6e990d7_6
- libopus=1.3=h7b6447c_0
- libpng=1.6.37=hbc83047_0
- libprotobuf=3.9.1=h8b12597_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- libtiff=4.0.10=h2733197_2
- libuuid=1.0.3=h1bed415_2
- libvpx=1.7.0=h439df22_0
- libxcb=1.13=h1bed415_1
- libxml2=2.9.9=hea5a465_1
- markdown=3.1.1=py_0
- matplotlib-base=3.1.1=py37hfd891ef_0
- mock=3.0.5=py37_0
- ncurses=6.1=he6710b0_1
- networkx=2.3=py_0
- numpy=1.17.0=py37h95a1406_0
- olefile=0.46=py_0
- openssl=1.1.1d=h516909a_0
- pandas=0.25.3=py37hb3f55d8_0
- pcre=8.43=he6710b0_0
- pillow=6.1.0=py37h34e0f95_0
- pip=19.1.1=py37_0
- pixman=0.38.0=h7b6447c_0
- protobuf=3.9.1=py37he1b5a44_0
- pyparsing=2.4.2=py_0
- python=3.7.3=h0371630_0
- python-dateutil=2.8.0=py_0
- pytz=2019.3=py_0
- pywavelets=1.0.3=py37hd352d35_1
- readline=7.0=h7b6447c_5
- scikit-image=0.15.0=py37hb3f55d8_2
- scikit-learn=0.21.3=py37hcdab131_0
- scikit-video=1.1.11=pyh24bf2e0_0
- scipy=1.3.0=py37h921218d_1
- setuptools=41.0.1=py37_0
- six=1.12.0=py37_1000
- sqlite=3.29.0=h7b6447c_0
- tensorboard=1.13.1=py37_0
- tensorflow=1.13.1=py37_0
- tensorflow-estimator=1.13.0=py_0
- termcolor=1.1.0=py_2
- tk=8.6.9=hed695b0_1002
- toolz=0.10.0=py_0
- tornado=6.0.3=py37h516909a_0
- werkzeug=0.15.5=py_0
- wheel=0.33.4=py37_0
- xz=5.2.4=h14c3975_4
- zlib=1.2.11=h7b6447c_3
- zstd=1.3.7=h0b5b093_0
- pip:
- imageio-ffmpeg==0.3.0
- opencv-python==4.1.0.25
- python-vlc==3.0.7110
- ssim==0.3.0
@NagabhushanSN95 Thank you. I will check it.