av2 stage3 crash during infer

Question

av2 stage3 crash during infer

billbliss3 opened this issue 3 months ago · comments

YOLO-you only live once commented 3 months ago

It is a awesome work.

But recently, I found a crash [av2 stage3 crash during infer]

2024-04-20 07:13:02,507 - mmdet - INFO - Iter [3400/34040] lr: 4.878e-05, eta: 2 days, 21:50:59, time: 7.116, data_time: 0.236, memory: 18159, cls: 0.3527, reg: 0.7917, d0.c
ls: 0.4578, d0.reg: 1.4255, d1.cls: 0.3876, d1.reg: 1.0831, d2.cls: 0.3705, d2.reg: 0.9497, d3.cls: 0.3587, d3.reg: 0.8702, d4.cls: 0.3446, d4.reg: 0.8254, seg: 0.4308, seg_dice:
0.1130, cls_t0: 0.3730, reg_t0: 0.8893, d0.cls_t0: 0.8766, d0.reg_t0: 2.3198, d1.cls_t0: 0.6213, d1.reg_t0: 1.4020, d2.cls_t0: 0.4872, d2.reg_t0: 1.1538, d3.cls_t0: 0.4217, d3.r
eg_t0: 1.0486, d4.cls_t0: 0.3839, d4.reg_t0: 0.9481, seg_t0: 0.6679, seg_dice_t0: 0.1954, cls_t1: 0.3272, reg_t1: 0.7694, d0.cls_t1: 0.4739, d0.reg_t1: 1.4077, d1.cls_t1: 0.4143,
d1.reg_t1: 1.1017, d2.cls_t1: 0.3704, d2.reg_t1: 0.9676, d3.cls_t1: 0.3430, d3.reg_t1: 0.8944, d4.cls_t1: 0.3325, d4.reg_t1: 0.8050, seg_t1: 0.4923, seg_dice_t1: 0.1315, cls_t2:
0.2955, reg_t2: 0.7101, d0.cls_t2: 0.4093, d0.reg_t2: 1.2690, d1.cls_t2: 0.3406, d1.reg_t2: 0.9811, d2.cls_t2: 0.3195, d2.reg_t2: 0.8782, d3.cls_t2: 0.3069, d3.reg_t2: 0.7965, d
4.cls_t2: 0.2930, d4.reg_t2: 0.7422, seg_t2: 0.4511, seg_dice_t2: 0.1157, cls_t3: 0.3088, reg_t3: 0.8126, d0.cls_t3: 0.4122, d0.reg_t3: 1.3771, d1.cls_t3: 0.3490, d1.reg_t3: 1.05
67, d2.cls_t3: 0.3231, d2.reg_t3: 0.9688, d3.cls_t3: 0.3187, d3.reg_t3: 0.8932, d4.cls_t3: 0.3129, d4.reg_t3: 0.8366, seg_t3: 0.4328, seg_dice_t3: 0.1138, total_t0: 11.7887, tota
l_t1: 8.8308, total_t2: 7.9087, total_t3: 8.5163, total_t4: 8.7612, f_trans_t0: 0.1254, b_trans_t0: 0.0976, f_trans_t1: 0.1138, b_trans_t1: 0.0859, f_trans_t2: 0.1246, b_trans_t2
: 0.0827, f_trans_t3: 0.1257, b_trans_t3: 0.0897, total: 46.6509, grad_norm: 145.7686
2024-04-20 07:13:23,480 - mmdet - INFO - Saving checkpoint at 3404 iterations
[ ] 0/23519, elapsed: 0s, ETA:Traceback (most recent call last):
File "tools/train.py", line 280, in
main()
File "tools/train.py", line 269, in main
custom_train_model(
File "/data/wk/Project/maptracker/plugin/core/apis/train.py", line 30, in custom_train_model
custom_train_detector(
File "/data/wk/Project/maptracker/plugin/core/apis/mmdet_train.py", line 228, in custom_train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 138, in run
iter_runner(iter_loaders[i], **kwargs)
File "/data/wk/Project/maptracker/plugin/core/apis/mmdet_train.py", line 49, in train
self.call_hook('after_train_iter')
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
getattr(hook, fn_name)(self)
File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 262, in after_train_iter
self._do_evaluate(runner)
File "/data/wk/Project/maptracker/plugin/core/evaluation/eval_hooks.py", line 78, in _do_evaluate
results = custom_multi_gpu_test(
File "/data/wk/Project/maptracker/plugin/core/apis/test.py", line 72, in custom_multi_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/data/wk/Project/maptracker/plugin/models/mapers/base_mapper.py", line 95, in forward
return self.forward_test(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/data/wk/Project/maptracker/plugin/models/mapers/MapTracker.py", line 638, in forward_test
self.temporal_propagate(bev_feats, img_metas, all_history_curr2prev,
File "/data/wk/Project/maptracker/plugin/models/mapers/MapTracker.py", line 158, in temporal_propagate
self.memory_bank.trans_memory_bank(self.query_propagate, b_i, img_metas[b_i])
File "/data/wk/Project/maptracker/plugin/models/mapers/vector_memory.py", line 236, in trans_memory_bank
relative_seq_pe = self.cached_pe[relative_seq_idx].to(mem_embeds.device)
IndexError: index 100 is out of bounds for dimension 0 with size 100

Jiacheng Chen · Answer 1 · Mon Apr 22 2024 15:31:24 GMT+0800 (China Standard Time)

Hi, thanks for your interest. The relative_seq_idx is the relative frame interval between the current frame and the past frame from the memory. In normal cases, this value should never exceed 100. I can try to diagnose if you provide more information:

(1). Are you running the old split or the new split?
(2). Are stage 2's training and inference all good, meaning the losses are all normal, and the testing results are reasonable?

YOLO-you only live once · Answer 2 · Mon Apr 22 2024 15:38:48 GMT+0800 (China Standard Time)

@woodfrog I am using the official av2 stage3 old_split train config, and it seems stage2 works well.

Jiacheng Chen · Answer 3 · Mon Apr 22 2024 15:51:53 GMT+0800 (China Standard Time)

@woodfrog I am using the official av2 stage3 old_split train config, and it seems stage2 works well.

Thanks. The log shows [ ] 0/23519, elapsed: 0s, , suggesting that you changed some settings (maybe by accident) like the data interval? With the default setting, the number of test frames is slightly less than 6000 -- AV2 frames are uniformly sub-sampled to keep the same frame rate as nuScenes.

The model should work well with a higher frame rate, but it might trigger some underlying tiny bugs. Can you confirm the test settings you are using? Then I will see if I can reproduce the error and fix it.

YOLO-you only live once · Answer 4 · Mon Apr 22 2024 16:01:24 GMT+0800 (China Standard Time)

You are right. I found the reason.
Since the av2 old split load maptr_info.pkl, and the samples in line 65 in argo_dataset.py does not contain self.interval.

https://github.com/woodfrog/maptracker/blob/main/plugin/datasets/argo_dataset.py#L65C17-L65C81

Jiacheng Chen · Answer 5 · Mon Apr 22 2024 17:33:18 GMT+0800 (China Standard Time)

Yes, for the old split, the test samples are "hard coded" to ensure they are the same as those used in the MapTR codebase (the most popular codebase for this task), so self.interval is not used there.

I'm sorry I forgot to commit the "maptr_info.pkl" file. It contains the metadata directly exported from the MapTR codebase. I just added it, can you try again?

Jiacheng Chen · Answer 6 · Mon Apr 22 2024 17:42:12 GMT+0800 (China Standard Time)

For that "frame interval out of index" issue, the relative_seq_idx would be seq_id for invalid memory entries, and those values won't be used in memory fusion (masked). But when seq_id becomes greater than 100 in a very long sequence, that error occurs.

My original assumption was that all sequences would be less than 100 frames, so I set 100 as the length of the pre-computed positional encodings. I will change it to 1000 so the inference won't be broken with longer test sequences.

YOLO-you only live once · Answer 7 · Mon Apr 22 2024 18:11:18 GMT+0800 (China Standard Time)

Yes, for the old split, the test samples are "hard coded" to ensure they are the same as those used in the MapTR codebase (the most popular codebase for this task), so self.interval is not used there.

I'm sorry I forgot to commit the "maptr_info.pkl" file. It contains the metadata directly exported from the MapTR codebase. I just added it, can you try again?

It seems dataset av2 missing some timestamp.

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg
return obj_cls(**args)
File "/data/wk/Project/maptracker/plugin/datasets/argo_dataset.py", line 31, in init
super().init(**kwargs)
File "/data/wk/Project/maptracker/plugin/datasets/base_dataset.py", line 62, in init
self.load_annotations(self.ann_file)
File "/data/wk/Project/maptracker/plugin/datasets/argo_dataset.py", line 65, in load_annotations
samples = [unique_token2samples[x] for x in maptr_unique_tokens]
File "/data/wk/Project/maptracker/plugin/datasets/argo_dataset.py", line 65, in
samples = [unique_token2samples[x] for x in maptr_unique_tokens]
KeyError: '15ec0778-826e-3ed7-9775-54fbf66997f4_315970274060083000'

Jiacheng Chen · Answer 8 · Mon Apr 22 2024 18:40:47 GMT+0800 (China Standard Time)

That's weird. Seems that our AV2 datasets are a bit different. Can you check how many MapTR test samples are available in your current AV2 dataset? Something like samples = [unique_token2samples[x] for x in maptr_unique_tokens if x in unique_token2samples] -- if the resulting samples is empty, there should be some naming inconsistencies.

YOLO-you only live once · Answer 9 · Mon Apr 22 2024 20:03:58 GMT+0800 (China Standard Time)

I have printed all missing stamp.
Only one left.

log as follows.
Prepare sequence information for ./datasets/av2/av2_map_infos_train.pkl
15ec0778-826e-3ed7-9775-54fbf66997f4_315970274060083000

YOLO-you only live once · Answer 10 · Mon Apr 22 2024 20:09:01 GMT+0800 (China Standard Time)

Total length of val is 23519

Jiacheng Chen · Answer 11 · Mon Apr 22 2024 20:28:24 GMT+0800 (China Standard Time)

Total length of val is 23519

The total length of val in my AV2 is 23522, so the difference comes from the metadata generated by the data converter. This file is borrowed from StreamMapNet's codebase without modification. There are some filterings to discard invalid data. The download probably failed for some data, and some samples were filtered, leading to different sample numbers.

In your case, I can think of two potential solutions:
(1) Skip that single sample, although it slightly changes the test set.
(2) Check the downloaded data and re-download the broken samples.

YOLO-you only live once · Answer 12 · Mon Apr 22 2024 20:37:15 GMT+0800 (China Standard Time)

Thanks for your advice and help