evaluation results on BADJA don't match the paper.

Question

evaluation results on BADJA don't match the paper.

AssafSinger94 opened this issue 8 months ago · comments

Hi,
When trying to evaluate the model on BADJA, I am getting different results then reported in the paper.
The results are as follows (I added the avg. results at the end of the dictionary):

{
    "bear": 88.57142639160156,
    "bear_accuracy": 20.357141494750977,
    "camel": 90.35369873046875,
    "camel_accuracy": 22.186494827270508,
    "cows": 86.89839935302734,
    "cows_accuracy": 31.283422470092773,
    "dog": 54.59769821166992,
    "dog-agility": 6.896551609039307,
    "dog-agility_accuracy": 0.0,
    "dog_accuracy": 4.597701072692871,
    "horsejump-high": 62.25165557861328,
    "horsejump-high_accuracy": 17.218544006347656,
    "horsejump-low": 62.30366897583008,
    "horsejump-low_accuracy": 27.74869155883789,
    "avg": 64.55329983575004,
    "avg acc 3px": 17.627427918570383,
    "time": 576.1103093624115
}

I was able to evaluate the model on TAP-Vid DAVIS properly.
I ran the following code.

python ./cotracker/evaluation/evaluate.py 
--config-name eval_badja \
exp_dir=./eval_outputs_badja \
dataset_root=<path_to_BADJA_dir> \

Could you please assist me in the matter?
In addition, I see that the "extra_videos" as referred to in BADJA are not being evaluated, and I see that they are being explicitly ignored during dataset creation. Could you please explain to me why are they not being evaluated?

Thank you for your help!
Assaf

Nikita Karaev · Answer 1 · Mon Nov 13 2023 01:19:22 GMT+0800 (China Standard Time)

Hi @AssafSinger94, the numbers reported in the paper are 63.6 and 18.0, whereas you have 64.6 and 17.6, where seg-based accuracy is better, 3px accuracy is worse. This small difference could be due to using different versions of some libraries, especially since BADJA is just a small set of 7 short videos. In this evaluation, we follow PIPs, so these videos are not included to keep the numbers consistent.

Assaf Singer · Answer 2 · Mon Nov 13 2023 01:54:49 GMT+0800 (China Standard Time)

Thank you for your reply! @nikitakaraevv
One more thing I wanted to ask you. I see that you are always sampling the trajectories at frame 0 for the query points, although a few of the trajectories (not many), are occluded on frame 0.
Is that how the query points are supposed to be sampled on BADJA? Isn't it supposed to be like TAP-Vid with query-mode='first', where you sample the first non-occluded frame for the trajectory? perhaps I misunderstood something in the paper.

Thank you very much for your assistance and responsivity!
Assaf

Nikita Karaev · Answer 3 · Mon Nov 13 2023 06:55:34 GMT+0800 (China Standard Time)

@AssafSinger94 You're right, it is supposed to function like in TAP-Vid. The results for this benchmark should be slightly better after fixing this bug.

Nikita Karaev · Answer 4 · Fri Dec 29 2023 01:39:14 GMT+0800 (China Standard Time)

We do not evaluate CoTracker on BADJA in the new version of the paper because BADJA is only a subset of DAVIS

LHY-HongyangLi · Answer 5 · Thu Feb 15 2024 21:50:40 GMT+0800 (China Standard Time)

Hi @nikitakaraevv,
I tried to reproduce the performance of cotrackerv2 using the checkpoint you have provided on DAVIS-First in "glob. 5×5" mode. But I got the following results, which are not the same as you have posted in table3, is there anything wrong?

evaluate_result {'occlusion_accuracy': 0.8830991955685764, 'pts_within_1': 0.41818242347516404, 'jaccard_1': 0.27439414118807914, 'pts_within_2': 0.6585306168489217, 'jaccard_2': 0.4944025070918661, 'pts_within_4': 0.8213521434366008, 'jaccard_4': 0.67601993042582, 'pts_within_8': 0.900469386477676, 'jaccard_8': 0.7759953203865924, 'pts_within_16': 0.9396396614828214, 'jaccard_16': 0.8167342923773883, 'average_jaccard': 0.607509238293949, 'average_pts_within_thresh': 0.7476348463442369}