Question about L2 computation

Question

Question about L2 computation

wljungbergh opened this issue 5 months ago · comments

William Ljungbergh commented 5 months ago

Hi, and thank you for your work.

When reviewing your evaluation code I find that your computation of the L2 displacement error (here) is computed using the average displacement error up to and including that particular time. This differs from how previous works (e.g., UniAD and ST-P3) have defined the metric. They instead compute the metric as the L2 norm at that particular timestep (see here and here)

I might have misunderstood your code and if so please let me know... but if not could you provide the numbers using the metric definition used in ST-P3 and UniAD? This would make them more easily comparable.

Can you please shed some light on this? Which of the two definitions is considered correct? (might very well be that UniAD and ST-P3 have defined the metric wrong).

Thanks,

Bo Jiang · Answer 1 · Tue Feb 27 2024 15:47:03 GMT+0800 (China Standard Time)

Please refer to this issue.

William Ljungbergh · Answer 2 · Tue Feb 27 2024 17:06:53 GMT+0800 (China Standard Time)

Thanks a lot for the clarification. I don't know how I missed that issue... sorry about that. I now see that you define the metric similarly to ST-P3.

However, upon digging into the code of UniAD they are not conforming to the definition from SP-T3, which they have acknowledged here.

planning_results_computed = results["planning_results_computed"]
planning_tab = PrettyTable()
planning_tab.field_names = [
    "metrics",
    "0.5s",
    "1.0s",
    "1.5s",
    "2.0s",
    "2.5s",
    "3.0s",
]
for key in planning_results_computed.keys():
    value = planning_results_computed[key]
    row_value = []
    row_value.append(key)
    for i in range(len(value)):
        row_value.append("%.4f" % float(value[i]))
    planning_tab.add_row(row_value)

Here, planning_results_computed is the results from a single PlanningMetric.compute() (with n_future=6), meaning that they are computing the L2 distance as the pointwise norm rather than the mean of the norms up to that timestep.

Because of this, the comparison between your method and UniAD is misleading (as VAD's numbers use the more lenient metric definition while UniAD numbers are presented in the same table but using a different metric definition).

It would decrease the confusion if you would add their performance when using your (and ST-P3's original) metric.

Here are their displacement values when using your (and ST-P3) metric definition:

Method	L2 (m) 1s	L2 (m) 2s	L2 (m) 3s
ST-P3	1.33	2.11	2.90
UniAD (their metric)	0.48	0.96	1.65
UniAD (your metric)	0.42	0.64	0.91
VAD-Tiny	0.46	0.76	1.12
VAD-Base	0.41	0.70	1.05

I will post these results on their GitHub as well in case they want to update their numbers (or show them in conjunction)

FYI, to comply with your metric definition we simply changed the code above to

for i in range(len(value)):
    row_value.append("%.4f" % float(value[:i+1].mean()))

PS. Please let us know if you think we've missed something and wrongly computed UniADs performance with your metric.