exiawsh / StreamPETR

[ICCV 2023] StreamPETR: Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How frequency influences the model?

zjr-bit opened this issue · comments

Hi, thanks for your great work!
And I'm trying to use custom dataset to train and test StreamPETR. The custom dataset is 10Hz, but frequency of key frames of nuscenes is 2Hz. I noticed that others have asked the question, but I didn't get the key points. So I want to ask how frequency influences the model, and how I modify the codes.
If you could give me some hints, I would be very grateful!

I have successfully load my custom dataset to train and test. But the metrics are very low. Could you please give me some hints? The below is my config file, and I trained 120 epochs on a little dataset(only including 1792 frames), and the AP of car is less than 0.1.

    '../../../mmdetection3d/configs/_base_/datasets/nus-3d.py',
    '../../../mmdetection3d/configs/_base_/default_runtime.py'
]
backbone_norm_cfg = dict(type='LN', requires_grad=True)
plugin=True
plugin_dir='projects/mmdet3d_plugin/'

# If point cloud range is changed, the models should also change their point
# cloud range accordingly
point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0]
voxel_size = [0.2, 0.2, 8]
img_norm_cfg = dict(
    mean=[103.530, 116.280, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False)
# For nuScenes we usually do 10-class detection
class_names = [
    'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier',
    'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
]

num_gpus = 4
batch_size = 2
num_iters_per_epoch = 1792 // (num_gpus * batch_size)
num_epochs = 150
pretrained = "ckpts/r101_dcn_backbone.pth"
queue_length = 1
num_frame_losses = 1
collect_keys=['lidar2img', 'intrinsics', 'extrinsics','timestamp', 'img_timestamp', 'ego_pose', 'ego_pose_inv']
input_modality = dict(
    use_lidar=False,
    use_camera=True,
    use_radar=False,
    use_map=False,
    use_external=True)
model = dict(
    type='myPetr3D',
    num_frame_head_grads=num_frame_losses,
    num_frame_backbone_grads=num_frame_losses,
    num_frame_losses=num_frame_losses,
    use_grid_mask=True,
    img_backbone=dict(
        type='ResNet',
        depth=101,
        num_stages=4,
        out_indices=(2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN2d', requires_grad=False),
        norm_eval=True,
        style='caffe',
        dcn=dict(type='DCNv2', deform_groups=1, fallback_on_stride=False), # original DCNv2 will print log when perform load_state_dict
        stage_with_dcn=(False, False, True, True),
        with_cp=False),
    img_neck=dict(
        type='CPFPN',  ###remove unused parameters 
        in_channels=[1024, 2048],
        out_channels=256,
        num_outs=2),
    pts_bbox_head=dict(
        type='StreamPETRHead',
        num_classes=10,
        in_channels=256,
        num_query=644,
        memory_len=1024,
        topk_proposals=256,
        num_propagated=256,
        with_ego_pos=True,
        match_with_velo=False,
        scalar=10, ##noise groups
        noise_scale = 1.0, 
        dn_weight= 1.0, ##dn loss weight
        split = 0.75, ###positive rate
        LID=True,
        with_position=True,
        position_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0],
        code_weights = [2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
        transformer=dict(
            type='PETRTemporalTransformer',
            decoder=dict(
                type='PETRTransformerDecoder',
                return_intermediate=True,
                num_layers=6,
                transformerlayers=dict(
                    type='PETRTemporalDecoderLayer',
                    attn_cfgs=[
                        dict(
                            type='MultiheadAttention',
                            embed_dims=256,
                            num_heads=8,
                            dropout=0.1),
                        dict(
                            type='PETRMultiheadFlashAttention',
                            embed_dims=256,
                            num_heads=8,
                            dropout=0.1),
                        ],
                    feedforward_channels=2048,
                    ffn_dropout=0.1,
                    with_cp=True,  ###use checkpoint to save memory
                    operation_order=('self_attn', 'norm', 'cross_attn', 'norm',
                                     'ffn', 'norm')),
            )),
        bbox_coder=dict(
            type='NMSFreeCoder',
            post_center_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0],
            pc_range=point_cloud_range,
            max_num=300,
            voxel_size=voxel_size,
            num_classes=10), 
        loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=2.0),
        loss_bbox=dict(type='L1Loss', loss_weight=0.25),
        loss_iou=dict(type='GIoULoss', loss_weight=0.0),),
    # model training and testing settings
    train_cfg=dict(pts=dict(
        grid_size=[512, 512, 1],
        voxel_size=voxel_size,
        point_cloud_range=point_cloud_range,
        out_size_factor=4,
        assigner=dict(
            type='HungarianAssigner3D',
            cls_cost=dict(type='FocalLossCost', weight=2.0),
            reg_cost=dict(type='BBox3DL1Cost', weight=0.25),
            iou_cost=dict(type='IoUCost', weight=0.0), # Fake cost. This is just to make it compatible with DETR head. 
            pc_range=point_cloud_range),)))


dataset_type = 'MyDataset'
data_root = 'data/mydataset/'
train_info_file = data_root + 'my_infos_temporal_train.pkl'
val_info_file = data_root + 'my_infos_temporal_val.pkl'
split_datas_file = 'data/split-data-spd.json'

file_client_args = dict(backend='disk')


ida_aug_conf = {
        "resize_lim": (0.8, 1.0),
        "final_dim": (512, 1408),
        "bot_pct_lim": (0.0, 0.0),
        "rot_lim": (0.0, 0.0),
        "H": 1080,
        "W": 1920,
        "rand_flip": True,
    }
train_pipeline = [
    dict(type='LoadMultiViewImageFromFiles', to_float32=True),
    dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_bbox=False,
        with_label=False, with_bbox_depth=False),
    dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
    dict(type='ObjectNameFilter', classes=class_names),
    dict(type='ResizeCropFlipRotImage', with_2d=False, data_aug_conf = ida_aug_conf, training=True),
    dict(type='GlobalRotScaleTransImage',
            rot_range=[-0.3925, 0.3925],
            translation_std=[0, 0, 0],
            scale_ratio_range=[0.95, 1.05],
            reverse_angle=True,
            training=True,
            ),
    dict(type='NormalizeMultiviewImage', **img_norm_cfg),
    dict(type='PadMultiViewImage', size_divisor=32),
    dict(type='PETRFormatBundle3D', class_names=class_names, collect_keys=collect_keys + ['prev_exists']),
    dict(type='Collect3D', keys=['gt_bboxes_3d', 'gt_labels_3d', 'img', 'prev_exists'] + collect_keys,
             meta_keys=('filename', 'ori_shape', 'img_shape', 'pad_shape', 'scale_factor', 'flip', 'box_mode_3d', 'box_type_3d', 'img_norm_cfg', 'scene_token', 'gt_bboxes_3d','gt_labels_3d'))
]
test_pipeline = [
    dict(type='LoadMultiViewImageFromFiles', to_float32=True),
    dict(type='ResizeCropFlipRotImage', with_2d=False, data_aug_conf = ida_aug_conf, training=False),
    dict(type='NormalizeMultiviewImage', **img_norm_cfg),
    dict(type='PadMultiViewImage', size_divisor=32),
    dict(
        type='MultiScaleFlipAug3D',
        img_scale=(1333, 800),
        pts_scale_ratio=1,
        flip=False,
        transforms=[
            dict(
                type='PETRFormatBundle3D',
                collect_keys=collect_keys,
                class_names=class_names,
                with_label=False),
            dict(type='Collect3D', keys=['img'] + collect_keys,
            meta_keys=('filename', 'ori_shape', 'img_shape','pad_shape', 'scale_factor', 'flip', 'box_mode_3d', 'box_type_3d', 'img_norm_cfg', 'scene_token'))
        ])
]

data = dict(
    samples_per_gpu=batch_size,
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file=train_info_file,
        num_frame_losses=num_frame_losses,
        seq_split_num=2, # streaming video training
        seq_mode=True, # streaming video training
        pipeline=train_pipeline,
        classes=class_names,
        modality=input_modality,
        collect_keys=collect_keys + ['img', 'prev_exists', 'img_metas'],
        queue_length=queue_length,
        test_mode=False,
        use_valid_flag=True,
        filter_empty_gt=False,
        box_type_3d='LiDAR'),
    val=dict(type=dataset_type, 
             data_root=data_root,
             pipeline=test_pipeline, 
             collect_keys=collect_keys + ['img', 'img_metas'], 
             queue_length=queue_length, 
             ann_file=val_info_file, 
             classes=class_names, 
             modality=input_modality,
             split_datas_file=split_datas_file),
    test=dict(type=dataset_type, 
              data_root=data_root,
              pipeline=test_pipeline, 
              collect_keys=collect_keys + ['img', 'img_metas'], 
              queue_length=queue_length, 
              ann_file=val_info_file, 
              classes=class_names, 
              modality=input_modality,
              split_datas_file=split_datas_file),
    shuffler_sampler=dict(type='InfiniteGroupEachSampleInBatchSampler'),
    nonshuffler_sampler=dict(type='DistributedSampler')
    )


optimizer = dict(
    type='AdamW', 
    lr=2e-4, # bs 8: 2e-4 || bs 16: 4e-4
    paramwise_cfg=dict(
        custom_keys={
            'img_backbone': dict(lr_mult=0.25), # 0.25 only for Focal-PETR with R50-in1k pretrained weights
        }),
    weight_decay=0.01)

optimizer_config = dict(type='Fp16OptimizerHook', loss_scale='dynamic', grad_clip=dict(max_norm=35, norm_type=2))
# learning policy
lr_config = dict(
    policy='CosineAnnealing',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
    min_lr_ratio=1e-3,
    )

evaluation = dict(interval=num_iters_per_epoch*8, pipeline=test_pipeline)
find_unused_parameters=False #### when use checkpoint, find_unused_parameters must be False
checkpoint_config = dict(interval=num_iters_per_epoch*4, max_keep_ckpts=3)
runner = dict(
    type='IterBasedRunner', max_iters=num_epochs * num_iters_per_epoch)
load_from=pretrained
resume_from="work_dirs/stream_petr_r101_flash_1408_bs2_seq_60e/iter_22400.pth"

I removed the img_roi_head and the image size was changed. The loss stoped at about 20. Should I remove the ResizeCropFlipRotImage?
BTW, I have visualized the results and ground truth, and it seems that ground truth is no problem, and model can successfully detect near objects.
Looking forward your reply! Thanks!

1、ResizeCropFlipRotImage

You need modify the “resize_lim”. 0.8 -1.0 is for nuScenes resolution (1600x900). Since your image resolution is 1920x1080, you should modify the “resize_lim” as 0.8/1.2 - 1.0/1.2 .
ida_aug_conf = {
"resize_lim": (0.8/1.2, 1.0/1.2 ),
"final_dim": (512, 1408),
"bot_pct_lim": (0.0, 0.0),
"rot_lim": (0.0, 0.0),
"H": 1080,
"W": 1920,
"rand_flip": True,
}

2、 Dataset frames

The dataset only has 1792 frames,which is very difficult for the model to converge. I do the experiment in another paper. When only use 25% data (28130/4 frames), the performance will drop a lot). When only use 10% data (2813 frames), the model not converge well.
image

Pre-training weight can alleviate this problem, but your pre-training is in nuScenes, which may lead to less promotion due to domain gap. In addition, for 1762 frames, don't train too long epoch, too long epoch will make the model collapse, and 24epoch is enough.

Due to the length of dataset, the low performance may be normal.

Thanks for your kind reply.
I have trained the whole dataset (about 7000 training frames of about 50 sequences in 10 Hz) for 90 epochs. I compare my dataset with nuscenes, and I find there are about 150 frames in a sequence in my dataset while there are about 40 frames in a sequence in nuscenes. So I set seq_split_num=4. The AP of Car is 0.2, and I think it is abnormal. I expect AP of Car is at least 0.3.
Following your advice, I will try to pretrain the model on nuscenes then modify the ida_aug_conf in my config and finetune on my dataset. Could you please give me some advice on number of epochs and learning rate? Thanks!

By the way the loss of training on the my whole dataset is below:
image
It seems the model doesn't converge? So I continue traing the model and wait to see the results.

1、epochs and learning rate
24 or 36 epoch is enough, learning rate 2e-4

2、Something you need to check
(1) seq_split_num=4 . You need check the sequence is processed as you excepted. For example, set bs=1 and gpu=1, print the timestamp.
(2) Open the ResizeCropFlipRotImage', 'GlobalRotScaleTransImage' and then project the 3D bbox to the 2D images. Check whether the augmentation is normal.

3、For training loss, train the model longer may lead to a lower loss, but may have a poor performance due to the limit dataset. In fact, 36 epoch is enough to verify whether the model work or not. 90epoch will not change the conclusion, but at most slightly improve the performance.

Thanks for your detailed advice. I will check and give a try.

@zjr-bit could you pleas provide the file structure and code for the custom dataset

Thanks for your detailed advice. I will check and give a try.
ida_aug_conf = {
"resize_lim": (0.8/1.2, 1.0/1.2 ),
"final_dim": (512, 1408),
"bot_pct_lim": (0.0, 0.0),
"rot_lim": (0.0, 0.0),
"H": 1080,
"W": 1920,
"rand_flip": True,
}

Is work the resize_lim is setted to (0.8/1.2, 1.0/1.2)? could you share the updated config file?

@Tony-Hou
ida_aug_conf = {
"resize_lim": (0.8, 1.0 ),
"final_dim": (512, 1408),
"bot_pct_lim": (0.0, 0.0),
"rot_lim": (0.0, 0.0),
"H": 1080,
"W": 1920,
"rand_flip": True,
}

@Tony-Hou ida_aug_conf = { "resize_lim": (0.8, 1.0 ), "final_dim": (512, 1408), "bot_pct_lim": (0.0, 0.0), "rot_lim": (0.0, 0.0), "H": 1080, "W": 1920, "rand_flip": True, }

ida_aug_conf = {
"resize_lim": (, ),
"final_dim": (540, 960),
"bot_pct_lim": (0.0, 0.0),
"rot_lim": (0.0, 0.0),
"H": 1080,
"W": 1920,
"rand_flip": True,
}
How to config resize_lim parameter when the final_dim is setted to (540, 960) ?

Thanks for your detailed advice. I will check and give a try.
ida_aug_conf = {
"resize_lim": (0.8/1.2, 1.0/1.2 ),
"final_dim": (512, 1408),
"bot_pct_lim": (0.0, 0.0),
"rot_lim": (0.0, 0.0),
"H": 1080,
"W": 1920,
"rand_flip": True,
}

Is work the resize_lim is setted to (0.8/1.2, 1.0/1.2)? could you share the updated config file?

yes, it works.