mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks

Home Page:https://mlcommons.org/en/groups/inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RetinaNet TEST05 is not passing for a Singlestream run

arjunsuresh opened this issue · comments

We are trying to reproduce a qaic submission but for retinanet SS run never passes TEST05. The performance numbers are consistent across the runs and so reruns or even longer runs are not helping

TEST05 log

================================================
MLPerf Results Summary
================================================
SUT name : KILT_SERVER
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 19641101
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
Early Stopping Result:
 * Processed at least 64 queries (56356).
 * Would discard 5469 highest latency queries.
 * Early stopping 90th percentile estimate: 19666047
 * Early stopping 99th percentile estimate: 21004798

Actual performance run

================================================
MLPerf Results Summary
================================================
SUT name : KILT_SERVER
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 18503321
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes
Early Stopping Result:
 * Processed at least 64 queries (35227).
 * Would discard 3390 highest latency queries.
 * Early stopping 90th percentile estimate: 18526238
 * Early stopping 99th percentile estimate: 20238324

@arjunsuresh Why is the number of samples different between the two runs? (10-11 minutes vs 16-17 minutes?) Maybe the system you are running on (AWS?) has insufficient cooling, so longer runs become slower?

@arjunsuresh Please provide full summary logs including settings like performance_sample_count.

Thank you @psyhtest for replying. The number of samples are different because on a repeated run we are automatically making the run longer. But we tried to match the duration for the TEST05 and performance runs - no difference. We even ran for 20 minutes each - then also no difference. But performance_sample_count = 128 worked. I think we should revisit TEST05 for retinanet considering the variable performance on different inputs.

Recommended to use a high performance_sample_count for RetinaNet. Will discuss this issue further for 4.1. Closing for now.