tensorpack / tensorpack

A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Debug error using TFLocalCLIDebugHook()

ck6698000 opened this issue · comments

I was trying to debug my session. After I add TFLocalCLIDebugHook() in Callbacks of training file (according to official reply here #631 (comment)), it report several errors as below.

1. What you did:

add 'TFLocalCLIDebugHook()' at train_xxx.py around line120:
#Create callbacks
callbacks = [
PeriodicCallback(
ModelSaver(max_to_keep=10, keep_checkpoint_every_n_hours=1),
every_k_epochs=cfg.TRAIN.CHECKPOINT_PERIOD),
#......
PathLog(args.logdir),
TFLocalCLIDebugHook() #added part
]

Then run the file using configs as usual.

2. What you observed:

(1) Include the ENTIRE logs here:

[0716 18:24:32 @eval.py:314] [EvalCallback] Will evaluate every 1 epochs
[0716 18:24:32 @base.py:275] Start Epoch 1 ...
  0%|                                                                                                                                           |0/500[00:00<?,?it/s]
Traceback (most recent call last):
  File "/home/mist/wrs/project/ssl_detection/third_party/tensorpack/tensorpack/train/base.py", line 279, in main_loop
    if self.hooked_sess.should_stop():
AttributeError: 'LocalCLIDebugHook' object has no attribute 'should_stop'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_stg1.py", line 199, in <module>
    launch_train_with_config(traincfg, trainer)
  File "/home/mist/wrs/project/ssl_detection/third_party/tensorpack/tensorpack/train/interface.py", line 101, in launch_train_with_config
    extra_callbacks=config.extra_callbacks)
  File "/home/mist/wrs/project/ssl_detection/third_party/tensorpack/tensorpack/train/base.py", line 344, in train_with_defaults
    steps_per_epoch, starting_epoch, max_epoch)
  File "/home/mist/wrs/project/ssl_detection/third_party/tensorpack/tensorpack/train/base.py", line 316, in train
    self.main_loop(steps_per_epoch, starting_epoch, max_epoch)
  File "/home/mist/wrs/project/ssl_detection/third_party/tensorpack/tensorpack/utils/argtools.py", line 168, in wrapper
    return func(*args, **kwargs)
  File "/home/mist/wrs/project/ssl_detection/third_party/tensorpack/tensorpack/train/base.py", line 297, in main_loop
    self.hooked_sess.close()
AttributeError: 'LocalCLIDebugHook' object has no attribute 'close'
MultiProcessMapDataZMQ successfully cleaned-up.
(tfenv)

3. What you expected, if not obvious.

4. Your environment:

/mistgpu/miniconda3/envs/tfenv/lib/python3.6/runpy.py:125: RuntimeWarning: 'tensorpack.tfutils.collect_env' found in sys.modules after import of package 'tensorpack.tfutils', but prior to execution of 'tensorpack.tfutils.collect_env'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))


sys.platform linux
Python 3.6.13 |Anaconda, Inc.| (default, Feb 23 2021, 21:15:04) [GCC 7.3.0]
Tensorpack v0.11-5-ge7c32ae4-dirty @/mistgpu/miniconda3/envs/tfenv/lib/python3.6/site-packages/tensorpack
Numpy 1.16.4
TensorFlow 1.14.0/v1.14.0-rc1-22-gaf24dc91b5 @/mistgpu/miniconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow
TF Compiler Version 4.8.5
TF CUDA support True
TF MKL support False
TF XLA support False
Nvidia Driver /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.39
CUDA libs /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130
CUDNN libs /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
TF built with CUDA 10
TF built with CUDNN 7
NCCL libs /usr/lib/x86_64-linux-gnu/libnccl.so.2.6.4
CUDA_VISIBLE_DEVICES Unspecified
GPU 0,1 GeForce RTX 2080 Ti
Free RAM 61.64/62.00 GB
CPU Count 24
cv2 4.5.2
msgpack 1.0.2
python-prctl False


Detecting GPUs using TensorFlow:
2021-07-16 19:16:59.181386: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2021-07-16 19:16:59.211541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
2021-07-16 19:16:59.212600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:03:00.0
2021-07-16 19:16:59.212867: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-07-16 19:16:59.214748: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2021-07-16 19:16:59.216237: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2021-07-16 19:16:59.216623: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2021-07-16 19:16:59.218687: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2021-07-16 19:16:59.219889: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2021-07-16 19:16:59.223385: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2021-07-16 19:16:59.226440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
GPUs: /physical_device:GPU:0, /physical_device:GPU:1

nvm, no problem now