Strange behavior of load_gen for LLaMA Server
szutenberg opened this issue · comments
I'm having issues with unexpected behavior of Server scenario.
-
TTFT definition
My understanding is that TTFT for sample x is defined as time between issuing a query and calling lg.FirstTokenComplete by SUT for sample x.
Looks like the same definition is applied in the LoadGen. -
TPOT definition
My understanding is that TPOT for sample x is defined as time between calling lg.QuerySamplesComplete and lg.FirstTokenComplete by SUT for sample x divided by number of generated tokens minus one. -
TPOT and TTFT latency constraint definition
My understanding is that LoadGen should print percentile statistics for TTFT and TPOT and compare their 99 percentile latencies with
llama2-70b.Server.ttft_latency = 2000 [ms]
andllama2-70b.Server.tpot_latency = 200 [ms]
, respectively. -
Current code: target latency is introduced:
llama2-70b.Server.target_latency = 2000
I do not understand this setting and AFAIK it was not discussed on the LLM taskforce meetings.
According to the rules, TPOT=200ms and TTFT=2000ms so my understanding is that maximum theoretical latency ist=2 sec + 1023*0.2 = 206.6 sec
but value oftarget_latency
must not be used in token_latencies Server scenario.
GPT-J example:
gptj.Server.target_latency = 20000
so the valid performance run is when "99.00 percentile latency" is below 20 seconds.
Example: Habana GPT-J submission in 3.1 :
99.00 percentile latency (ns) : 19 598 835 571 = 19.6 sec < 20 sec => Result is VALID
Proposed solution: remove this parameter or track ttft by this param (requires changes so that first token is logged as sample latency and tpot is tracked separately) -
Current code: TTFT criteria
Condition if (lp.percentile == .999) & (lp.sample_latency > settings.server_ttft_latency) compares 99.9 percentile TTFT latency withllama2-70b.Server.ttft_latency = 2000 [ms]
.
Proposed solution: use 99 percentile or even 97 percentile like in the MLPerf paper for translation task:
Server. The server scenario represents online applications where query arrival is random and latency is important. Almost every consumer-facing website is a good example, including services such as online translation from Baidu, Google, and Microsoft. For this scenario, queries have one sample each, in accordance with a Poisson distribution. The system under test responds to each query within a benchmark-specific latency bound that varies from 15 to 250 milliseconds. No more than 1% of queries may exceed the latency bound for the vision tasks and no more than 3% may do so for translation. The server scenario’s performance metric is the Poisson parameter that indicates the queries-per-second (QPS) achievable while meeting the QoS requirement. -
Current code: TPOT criteria
TPOT value is calculated from averages: tpot = sample_count * (sample_latency_mean - first_token_latency_mean) / (token_count) and compared withllama2-70b.Server.tpot_latency = 200 [ms]
. No statistics are provided, 99 percentile is not being used, my TPOT definition is wrong because we don't calculate TPOT values individually for each sample. This issue is reported also in #1592 .
Proposed solution: discuss on the Inference WG, I don't understand why we don't use 97/99 percentile. In my opinion, by calculating mean we are not able to check the QoS requirement. -
Current code: differences between Scheduled samples, Completed samples and target_qps
In our experiment we set target_latency to very large number 10000000000000000 (see point 4) and target_qps to 100 (dummy SUT is not able to achieve such performance).
Scheduled samples per second : 100.59
Completed samples per second : x.xx (much lower value than 100, closer to the real performance).
Result is : VALID
Note that "scheduled samples per second" is being reported to the final results table.
Proposed solution: in use_token_latencies mode LoadGen should schedule samples according to target_qps value (turn off early stopping? I'm not familiar with LoadGen internals but due to specifics of the LLaMA benchmark (inference time depending on input and output length) I don't think that qps should be even slighly modified. I'd expect that LoadGen should check if latency constraints are met for the given QPS.
I think that this new scenario should be explained somewhere. The current flow is difficult to understand and it is different from previous models (percentiles are not used).