mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks

Home Page:https://mlcommons.org/en/groups/inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Llama2 LoadGen server mode: TPS not reported properly

nvzhihanj opened this issue · comments

In v4.0 submission, we found in the server log that "result_token_throughput" is not reported properly, and most of them are at the e-09 scale (@pgmpablo157321 feel free to to check this link and this link. There is another metric "result_token_throughput_with_loadgen_overhead", but I am not sure if it shows the right metric.

Since we are shifting from QPS to TPS, we need to fix and report the "result_completed_token_per_second".

Seems like the result_token_per_second is in the summary.txt. Not sure why it's not in the details.txt