run train.py error

Question

run train.py error

warmingkkk opened this issue a year ago · comments

hello author
When I run train.py it shows
Ground truth poses are not available for sequence xx. and ForkingPickler(file, protocol).dump(obj) _pickle.PicklingError: Can't pickle <class 'pykitti.odometry.CalibData'>: attribute lookup CalibData on pykitti.odometry failed
I processed the dataset according to the steps in the README, and then I didn't find a solution on the Internet, can you help me?

Rui Li · Answer 1 · Tue Jul 25 2023 17:18:56 GMT+0800 (China Standard Time)

Hi,

Seems that something goes wrong with the GT loading process.

Can you double-check if the calib.txt file and poses_dvso are placed in the right dir?
what PyKitti version do you use? 0.3.1 is suitable for our implementation.
It would be nice if you post which line raises the error so I can take a look at it.

warmingkkk · Answer 2 · Tue Jul 25 2023 18:52:01 GMT+0800 (China Standard Time)

I double checked that the directory should not be misplaced, and my PyKitti version is also 0.3.1
The GT data I downloaded was 14G, and then I ran preprocess_kitti_transfer_gtdepth_to_odom.py
My specific directory organization is like this:

data
 └─dataset
     ├─poses_dvso
     │  ├─00.txt
     │  ├─......
     │  └─21.txt
     └─sequences
         ├─00
         │  ├─calib.txt
         │  ├─times.txt
         │  ├─image_2
         │  ├─image_depth_annotated
         │  └─mvobj_mask
         ├─......
         ├─10
         │  ├─calib.txt
         │  ├─times.txt
         │  ├─image_2
         │  ├─image_depth_annotated
         │  └─mvobj_mask
         ├─11
         │  ├─calib.txt
         │  ├─times.txt
         │  └─image_2
         ├─......
         └─21
             ├─calib.txt
             ├─times.txt
             └─image_2

poses_dvso is downloaded from README;
The calib.txt and times.txt in the sequences come from data_odometry_calib(1MB);
image_2 is from data_odometry_color(65G);
image_depth_annotated is generated by running preprocess_kitti_transfer_gtdepth_to_odom.py

warmingkkk · Answer 3 · Thu Jul 27 2023 19:26:45 GMT+0800 (China Standard Time)

Sorry, the previous answer forgot to say which line reported the error. The following is the error message I displayed:

Traceback (most recent call last):
  File "E:\PycharmProjects\dynamic-multiframe-depth-main\train.py", line 75, in <module>
    main(config, config.args.options)
  File "E:\PycharmProjects\dynamic-multiframe-depth-main\train.py", line 54, in main
    trainer.train()
  File "E:\PycharmProjects\dynamic-multiframe-depth-main\base\base_trainer.py", line 73, in train
    result = self._train_epoch(epoch)
  File "E:\PycharmProjects\dynamic-multiframe-depth-main\trainer\trainer.py", line 84, in _train_epoch
    for batch_idx, (data, target) in enumerate(self.data_loader):
  File "C:\Users\4501\.conda\envs\dymultidepth\lib\site-packages\torch\utils\data\dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "C:\Users\4501\.conda\envs\dymultidepth\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\4501\.conda\envs\dymultidepth\lib\site-packages\torch\utils\data\dataloader.py", line 801, in __init__
    w.start()
  File "C:\Users\4501\.conda\envs\dymultidepth\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\4501\.conda\envs\dymultidepth\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\4501\.conda\envs\dymultidepth\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\4501\.conda\envs\dymultidepth\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\4501\.conda\envs\dymultidepth\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'pykitti.odometry.CalibData'>: attribute lookup CalibData on pykitti.odometry failed

Rui Li · Answer 4 · Thu Jul 27 2023 23:44:30 GMT+0800 (China Standard Time)

thanks for sharing. The data structure seems good but I noticed you are using Windows. The error may occur due to the different multi-processing protocols between Linux and Windows. You can try solving the issue by:

Using Linux instead
Setting num_workers=0 in the DataLoader to disable multiprocessing

Feel free to give additional feedback if you encounter further issues.

warmingkkk · Answer 5 · Sat Aug 05 2023 19:52:58 GMT+0800 (China Standard Time)

It's hard, this part still reports an error after I run it on linux:

OrderedDict([('name', 'dy_multi_depth'), ('n_gpu', 8), ('arch', OrderedDict([('type', 'DyMultiDepthModel'), ('args', OrderedDict([('pretrain_mode', 1), ('pretrain_dropout', 0.0), ('augmentation', 'depth'), ('use_mono', True), ('use_stereo', False), ('checkpoint_location', []), ('fusion_type', 'ccf_fusion'), ('input_size', [256, 512]), ('freeze_backbone', False), ('backbone_type', 'efficientnetb5')]))])), ('data_loader', OrderedDict([('type', 'KittiOdometryDataloader'), ('args', OrderedDict([('dataset_dir', '../data/dataset/'), ('depth_folder', 'image_depth_annotated'), ('batch_size', 8), ('frame_count', 2), ('shuffle', True), ('validation_split', 0), ('num_workers', 16), ('sequences', ['01', '02', '06', '08', '09', '10']), ('target_image_size', [256, 512]), ('use_color', True), ('use_color_augmentation', True), ('use_dso_poses', True), ('lidar_depth', True), ('dso_depth', False), ('return_stereo', False), ('return_mvobj_mask', True)]))])), ('val_data_loader', OrderedDict([('type', 'KittiOdometryDataloader'), ('args', OrderedDict([('dataset_dir', '../data/dataset/'), ('depth_folder', 'image_depth_annotated'), ('batch_size', 16), ('frame_count', 2), ('shuffle', False), ('validation_split', 0), ('num_workers', 2), ('sequences', ['00', '04', '05', '07']), ('target_image_size', [256, 512]), ('max_length', 32), ('use_color', True), ('use_color_augmentation', True), ('use_dso_poses', True), ('lidar_depth', True), ('dso_depth', False), ('return_stereo', False), ('return_mvobj_mask', True)]))])), ('optimizer', OrderedDict([('type', 'Adam'), ('args', OrderedDict([('lr', 0.0001), ('weight_decay', 0), ('amsgrad', True)]))])), ('loss', 'abs_silog_loss_virtualnormal'), ('metrics', ['a1_sparse_metric', 'abs_rel_sparse_metric', 'rmse_sparse_metric']), ('lr_scheduler', OrderedDict([('type', 'StepLR'), ('args', OrderedDict([('step_size', 65), ('gamma', 0.1)]))])), ('trainer', OrderedDict([('compute_mask', False), ('compute_stereo_pred', False), ('epochs', 80), ('save_dir', '../saved_model/'), ('save_period', 1), ('verbosity', 2), ('log_step', 4800), ('val_log_step', 40), ('alpha', 0.5), ('max_distance', 80), ('monitor', 'min abs_rel_sparse_metric'), ('timestamp_replacement', '00'), ('tensorboard', True)]))])
Ground truth poses are not avaialble for sequence 01.
Ground truth poses are not avaialble for sequence 02.
Ground truth poses are not avaialble for sequence 06.
Ground truth poses are not avaialble for sequence 08.
Ground truth poses are not avaialble for sequence 09.
Ground truth poses are not avaialble for sequence 10.
Ground truth poses are not avaialble for sequence 00.
Ground truth poses are not avaialble for sequence 04.
Ground truth poses are not avaialble for sequence 05.
Ground truth poses are not avaialble for sequence 07.

Rui Li · Answer 6 · Sat Aug 05 2023 21:20:13 GMT+0800 (China Standard Time)

Does the training/testing procedure go well? Actually Ground truth poses are not available for sequence xx is not an error but a warning, since you do not use the KITTI ground-truth pose but the pose downloaded from DVSO. The information you pasted will not interrupt the training/inference process

warmingkkk · Answer 7 · Sat Aug 05 2023 21:48:04 GMT+0800 (China Standard Time)

Thank you very much for your reply.
Not training, now I have a cuda error, I am trying to solve, I am using cuda-10.2 version and its corresponding cudnn version, can you tell me which version you are using?
I originally installed cuda in conda, but there was no corresponding display when using nvcc -V, so I installed cuda and cudnn in the user directory, and then configured the environment
Here's the environment I'm running in:

conda list

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   1.4.0                    pypi_0    pypi
backcall                  0.2.0              pyhd3eb1b0_0  
ca-certificates           2023.05.30           h06a4308_0  
cachetools                5.3.1                    pypi_0    pypi
certifi                   2022.12.7        py37h06a4308_0  
charset-normalizer        3.2.0                    pypi_0    pypi
colour-demosaicing        0.1.6                    pypi_0    pypi
colour-science            0.3.16                   pypi_0    pypi
cudatoolkit               10.2.89              hfd86e86_1  
cudnn                     7.6.5                cuda10.2_0  
cycler                    0.11.0                   pypi_0    pypi
decorator                 5.1.1              pyhd3eb1b0_0  
fonttools                 4.38.0                   pypi_0    pypi
google-auth               2.22.0                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.56.2                   pypi_0    pypi
idna                      3.4                      pypi_0    pypi
imageio                   2.31.1                   pypi_0    pypi
importlib-metadata        6.7.0                    pypi_0    pypi
ipython                   7.31.1           py37h06a4308_1  
jedi                      0.18.1           py37h06a4308_1  
kiwisolver                1.4.4                    pypi_0    pypi
kornia                    0.5.11                   pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
markdown                  3.4.4                    pypi_0    pypi
markupsafe                2.1.3                    pypi_0    pypi
matplotlib                3.5.3                    pypi_0    pypi
matplotlib-inline         0.1.6            py37h06a4308_0  
ncurses                   6.4                  h6a678d5_0  
networkx                  2.6.3                    pypi_0    pypi
numpy                     1.21.6                   pypi_0    pypi
oauthlib                  3.2.2                    pypi_0    pypi
opencv-python             4.8.0.74                 pypi_0    pypi
openssl                   1.1.1v               h7f8727e_0  
packaging                 23.1                     pypi_0    pypi
pandas                    1.3.5                    pypi_0    pypi
parso                     0.8.3              pyhd3eb1b0_0  
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pillow                    9.5.0                    pypi_0    pypi
pip                       22.3.1           py37h06a4308_0  
prompt-toolkit            3.0.36           py37h06a4308_0  
protobuf                  3.20.3                   pypi_0    pypi
ptyprocess                0.7.0              pyhd3eb1b0_2  
pyasn1                    0.5.0                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
pygments                  2.11.2             pyhd3eb1b0_0  
pykitti                   0.3.1                    pypi_0    pypi
pyparsing                 3.1.1                    pypi_0    pypi
python                    3.7.16               h7a1cb2a_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytz                      2023.3                   pypi_0    pypi
pywavelets                1.3.0                    pypi_0    pypi
readline                  8.2                  h5eee18b_0  
requests                  2.31.0                   pypi_0    pypi
requests-oauthlib         1.3.1                    pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
scikit-image              0.19.3                   pypi_0    pypi
scipy                     1.7.3                    pypi_0    pypi
setuptools                65.6.3           py37h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0  
tensorboard               2.11.2                   pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tensorboardx              2.6.2                    pypi_0    pypi
tifffile                  2021.11.2                pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
torch                     1.7.1                    pypi_0    pypi
torchvision               0.8.2                    pypi_0    pypi
tqdm                      4.65.0                   pypi_0    pypi
traitlets                 5.7.1            py37h06a4308_0  
typing-extensions         4.7.1                    pypi_0    pypi
urllib3                   1.26.16                  pypi_0    pypi
wcwidth                   0.2.5              pyhd3eb1b0_0  
werkzeug                  2.2.3                    pypi_0    pypi
wheel                     0.38.4           py37h06a4308_0  
xz                        5.4.2                h5eee18b_0  
zipp                      3.15.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0

gpu information：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:25:00.0 Off |                    0 |
| N/A   37C    P0    38W / 250W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:5B:00.0 Off |                    0 |
| N/A   40C    P0    25W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  Off  | 00000000:C8:00.0 Off |                    0 |
| N/A   37C    P0    38W / 250W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

my cuda version information：

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

The error message is very long, I intercepted the bottom part of the message：

Traceback (most recent call last):
  File "train.py", line 75, in <module>
    main(config, config.args.options)
  File "train.py", line 54, in main
    trainer.train()
  File "/home/xxx/project/dynamic-multiframe-depth/base/base_trainer.py", line 73, in train
    result = self._train_epoch(epoch)
  File "/home/xxx/project/dynamic-multiframe-depth/trainer/trainer.py", line 95, in _train_epoch
    data = self.model(data)
  File "/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 160, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 56, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: CUDA error: invalid device function

It should be a version compatibility problem, I am trying to solve it.

8976777 trainable parameters
20666289 total parameters
/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
NVIDIA A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the NVIDIA A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

Thanks again for your reply!!!

Rui Li · Answer 8 · Sun Aug 06 2023 01:52:52 GMT+0800 (China Standard Time)

Yes, it is a compatibility issue, you may need a newer cuda version. And you can also consider installing a cudatoolkit with proper version in your virtual environment, which is more convenient

warmingkkk · Answer 9 · Sun Aug 06 2023 14:56:45 GMT+0800 (China Standard Time)

The code is working, thanks for your help! ! !

8976777 trainable parameters
20666289 total parameters
/home/xxxg/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:3: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if not hasattr(tensorboard, '__version__') or LooseVersion(tensorboard.__version__) < LooseVersion('1.15'):
/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:3: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if not hasattr(tensorboard, '__version__') or LooseVersion(tensorboard.__version__) < LooseVersion('1.15'):
/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/functional.py:3385: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  warnings.warn("Default grid_sample and affine_grid behavior has changed "
/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/functional.py:1628: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
  warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/nn/functional.py:2952: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
/home/xxx/miniconda3/envs/dymultidepth/lib/python3.7/site-packages/torch/cuda/nccl.py:48: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  if not isinstance(inputs, collections.Container) or isinstance(inputs, torch.Tensor):
Train Epoch: 1 [0/13666 (0%)] Loss: 343.826538 Loss_dict: {'sdl_0': tensor(10.8141, device='cuda:0'), 'vnl_0': tensor(2.1626, device='cuda:0'), 'sdl_mono_0': tensor(7.3844, device='cuda:0'), 'vnl_mono_0': tensor(2.1568, device='cuda:0'), 'sdl_1': tensor(11.3660, device='cuda:0'), 'vnl_1': tensor(1.9663, device='cuda:0'), 'sdl_mono_1': tensor(10.1666, device='cuda:0'), 'vnl_mono_1': tensor(1.9820, device='cuda:0'), 'sdl_2': tensor(10.4781, device='cuda:0'), 'vnl_2': tensor(2.0398, device='cuda:0'), 'sdl_mono_2': tensor(8.7291, device='cuda:0'), 'vnl_mono_2': tensor(2.1686, device='cuda:0'), 'sdl_3': tensor(10.4102, device='cuda:0'), 'vnl_3': tensor(2.1297, device='cuda:0'), 'sdl_mono_3': tensor(12.4303, device='cuda:0'), 'vnl_mono_3': tensor(2.1056, device='cuda:0'), 'loss': tensor(343.8265, device='cuda:0')}
    epoch          : 1
    loss           : 84.4102581151401
    a1_sparse_metric: 0.8321580328881052
    abs_rel_sparse_metric: 0.19655859551456256
    rmse_sparse_metric: 5.316089861570149
    loss_vnl_mono_3: 0.4962688555418922
    loss_vnl_mono_2: 0.39543731314456554
    loss_sdl_mono_3: 3.794278452311293
    loss_loss      : 84.41030939145699
    loss_sdl_2     : 1.780937145717525
    loss_sdl_3     : 1.896988513247879
    loss_vnl_2     : 0.2154485480418465
    loss_sdl_mono_0: 3.127899687774283
    loss_vnl_0     : 0.2292159400256912
    loss_sdl_mono_2: 3.1041853331626683
    loss_sdl_0     : 1.8299175550394968
    loss_sdl_1     : 1.8029068019949532
    loss_vnl_1     : 0.22366136581036974
    loss_sdl_mono_1: 3.1103964192327385
    loss_vnl_mono_0: 0.42073565327402723
    loss_vnl_3     : 0.23219703776994038
    loss_vnl_mono_1: 0.4072599907657987
    val_loss       : 64.58046007156372
    val_a1_sparse_metric: 0.8827192336320877
    val_abs_rel_sparse_metric: 0.10185734834522009
    val_rmse_sparse_metric: 3.7820256650447845
    val_loss_vnl_mono_3: 0.4175424575805664
    val_loss_vnl_mono_2: 0.3316922187805176
    val_loss_sdl_mono_3: 2.88517427444458
    val_loss_loss  : 64.58045959472656
    val_loss_sdl_2 : 1.3693907260894775
    val_loss_sdl_3 : 1.5029700994491577
    val_loss_vnl_2 : 0.1644209921360016
    val_loss_sdl_mono_0: 2.355807065963745
    val_loss_vnl_0 : 0.1604299396276474
    val_loss_sdl_mono_2: 2.413699150085449
    val_loss_sdl_0 : 1.3518812656402588
    val_loss_sdl_1 : 1.3781380653381348
    val_loss_vnl_1 : 0.16939757764339447
    val_loss_sdl_mono_1: 2.371560573577881
    val_loss_vnl_mono_0: 0.31747955083847046
    val_loss_vnl_3 : 0.19623839855194092
    val_loss_vnl_mono_1: 0.3087731599807739
Saving checkpoint: ../saved_model/models/dy_multi_depth/00/checkpoint.pth ...
Saving current best: model_best.pth ...

Rui Li · Answer 10 · Mon Aug 28 2023 21:30:57 GMT+0800 (China Standard Time)

glad to hear that and I will close this issue. Feel free to reopen if you have further questions :)