mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks

Home Page:https://mlcommons.org/en/groups/inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Strange behavior of load_gen for LLaMA Server

szutenberg opened this issue · comments

I'm having issues with unexpected behavior of Server scenario.

  1. TTFT definition
    My understanding is that TTFT for sample x is defined as time between issuing a query and calling lg.FirstTokenComplete by SUT for sample x.
    Looks like the same definition is applied in the LoadGen.

  2. TPOT definition
    My understanding is that TPOT for sample x is defined as time between calling lg.QuerySamplesComplete and lg.FirstTokenComplete by SUT for sample x divided by number of generated tokens minus one.

  3. TPOT and TTFT latency constraint definition
    My understanding is that LoadGen should print percentile statistics for TTFT and TPOT and compare their 99 percentile latencies with
    llama2-70b.Server.ttft_latency = 2000 [ms] and llama2-70b.Server.tpot_latency = 200 [ms], respectively.

  4. Current code: target latency is introduced: llama2-70b.Server.target_latency = 2000
    I do not understand this setting and AFAIK it was not discussed on the LLM taskforce meetings.
    According to the rules, TPOT=200ms and TTFT=2000ms so my understanding is that maximum theoretical latency is t=2 sec + 1023*0.2 = 206.6 sec but value of target_latency must not be used in token_latencies Server scenario.
    GPT-J example:
    gptj.Server.target_latency = 20000 so the valid performance run is when "99.00 percentile latency" is below 20 seconds.
    Example: Habana GPT-J submission in 3.1 :
    99.00 percentile latency (ns) : 19 598 835 571 = 19.6 sec < 20 sec => Result is VALID
    Proposed solution: remove this parameter or track ttft by this param (requires changes so that first token is logged as sample latency and tpot is tracked separately)

  5. Current code: TTFT criteria
    Condition if (lp.percentile == .999) & (lp.sample_latency > settings.server_ttft_latency) compares 99.9 percentile TTFT latency with llama2-70b.Server.ttft_latency = 2000 [ms].
    Proposed solution: use 99 percentile or even 97 percentile like in the MLPerf paper for translation task:
    Server. The server scenario represents online applications where query arrival is random and latency is important. Almost every consumer-facing website is a good example, including services such as online translation from Baidu, Google, and Microsoft. For this scenario, queries have one sample each, in accordance with a Poisson distribution. The system under test responds to each query within a benchmark-specific latency bound that varies from 15 to 250 milliseconds. No more than 1% of queries may exceed the latency bound for the vision tasks and no more than 3% may do so for translation. The server scenario’s performance metric is the Poisson parameter that indicates the queries-per-second (QPS) achievable while meeting the QoS requirement.

  6. Current code: TPOT criteria
    TPOT value is calculated from averages: tpot = sample_count * (sample_latency_mean - first_token_latency_mean) / (token_count) and compared with llama2-70b.Server.tpot_latency = 200 [ms]. No statistics are provided, 99 percentile is not being used, my TPOT definition is wrong because we don't calculate TPOT values individually for each sample. This issue is reported also in #1592 .
    Proposed solution: discuss on the Inference WG, I don't understand why we don't use 97/99 percentile. In my opinion, by calculating mean we are not able to check the QoS requirement.

  7. Current code: differences between Scheduled samples, Completed samples and target_qps
    In our experiment we set target_latency to very large number 10000000000000000 (see point 4) and target_qps to 100 (dummy SUT is not able to achieve such performance).
    Scheduled samples per second : 100.59
    Completed samples per second : x.xx (much lower value than 100, closer to the real performance).
    Result is : VALID
    Note that "scheduled samples per second" is being reported to the final results table.
    Proposed solution: in use_token_latencies mode LoadGen should schedule samples according to target_qps value (turn off early stopping? I'm not familiar with LoadGen internals but due to specifics of the LLaMA benchmark (inference time depending on input and output length) I don't think that qps should be even slighly modified. I'd expect that LoadGen should check if latency constraints are met for the given QPS.

I think that this new scenario should be explained somewhere. The current flow is difficult to understand and it is different from previous models (percentiles are not used).

cc @nvzhihanj @attafosu @pgmpablo157321