triton-inference-server / fastertransformer_backend

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why processing requests of batch size=1 is much slower than batch size>1

mapcan opened this issue · comments

commented

I did some performace test of a 3.5B bloom 1-gpu model using perf_analyzer, the result is

batch size avg latency
1 6533769us
2 2819328us
4 2953732us

and then I tested 2-gpu version of this model, the result is

batch size avg latency
1 1769113us
2 3032188us
4 3461972us

1-gpu model is much slower when processing 1 size batch than processing 2 or 4 size batch, and 2-gpu model is faster than 1-gpu model when processing 1 size batch but slower when batch size > 1, how to explain these results? please help.

GPU: Tesla T4
CUDA Version: 11.8
model config: almost same as all_models/bloom
input data:

{
    "data": [
        {
            "INPUT_0": {
                "content": [
                    "话说大宋仁宗天子在位,嘉祐三年三月三日五更三点,天子驾坐紫宸殿,受百官朝贺。但见:祥云迷凤阁,瑞气罩龙楼。含烟御柳拂旌旗,带露宫花迎剑戟。天香影里,玉簪朱履聚丹墀
;仙乐声中,绣袄锦衣扶御驾。珍珠帘卷,黄金殿上现金轝,凤羽扇开,白玉阶前停宝辇。隐隐净鞭三下响,层层文武两班齐。当有殿头官喝道:“有事出班早奏,无事卷帘退朝。”只见班部丛中,宰相赵哲、参政文
彦博出班奏曰:“目今京师瘟疫盛行,伤损军民甚多。伏望陛下释罪宽恩,省刑薄税,祈禳天灾,救济万民。”天子听奏,急敕翰林院随即草诏,一面降赦天下罪囚,应有民间税赋,悉皆赦免;一面命在京宫观寺院,
修设好事禳灾。不料其年瘟疫转盛,仁宗天子闻知,龙体不安,复会百官计议。向那班部中,有一大臣,越班启奏。天子看时,乃是参知政事范仲淹,拜罢起居,奏曰:“目今天灾盛行,军民涂炭,日夕不能聊生。以
臣愚意,要禳此灾,可宣嗣汉天师星夜临朝,就京师禁院,修设三千六百分罗天大醮,奏闻上帝,可以禳保民间瘟疫。”仁宗天子准奏,急令翰林学士草诏一道,天子御笔亲书,并降御香一炷,钦差内外提点殿前太尉
洪信为天使,前往江西信州龙虎山,宣请嗣汉天师张真人星夜来朝,祈禳瘟疫。就金殿上焚起御香,亲将丹诏付与洪太尉,即便登程前去。洪信领了圣敕,辞别天子,背了诏书,盛了御香,带了数十人,上了铺马,
一行部队,离了东京,取路径投信州贵溪县来。但见:遥山叠翠,远水"
                ],
                "shape": [1]
            },
            "INPUT_1": {
                "content": [
                    200
                ],
                "shape": [1]
            },
            "INPUT_2": {
                "content": [
                    ""
                ],
                "shape": [1]
            },
            "INPUT_3": {
                "content": [
                    ""
                ],

                "content": [
                    1.0
                ],
                "shape": [1]
            },
            "repetition_penalty": {
                "content": [
                    1.03
                ],
                "shape": [1]
            },
            "random_seed": {
                "content": [
                    3
                ],
                "shape": [1]
            },
            "is_return_log_probs": {
                "content": [
                    true
                ],
                "shape": [1]
            },
            "is_return_context_embeddings": {
                "content": [
                    true
                ],
                "shape": [1]
            },
            "beam_width": {
                "content": [
                    1
                ],
                "shape": [1]
            },
            "start_id": {
                "content": [
                    1
                ],
                "shape": [1]
            },
            "end_id": {
                "content": [
                    2
                ],
                "shape": [1]
            }
        }
    ]
}

perf_analyzer command

perf_analyzer -m ensemble --input-data  input_data --measurement-mode=count_windows --concurrency-range 1   -b {batch size}