Why processing requests of batch size=1 is much slower than batch size>1

Question

Why processing requests of batch size=1 is much slower than batch size>1

mapcan opened this issue a year ago · comments

I did some performace test of a 3.5B bloom 1-gpu model using perf_analyzer, the result is

batch size	avg latency
1	6533769us
2	2819328us
4	2953732us

and then I tested 2-gpu version of this model, the result is

batch size	avg latency
1	1769113us
2	3032188us
4	3461972us

1-gpu model is much slower when processing 1 size batch than processing 2 or 4 size batch, and 2-gpu model is faster than 1-gpu model when processing 1 size batch but slower when batch size > 1, how to explain these results? please help.

GPU: Tesla T4
CUDA Version: 11.8
model config: almost same as all_models/bloom
input data:

{
    "data": [
        {
            "INPUT_0": {
                "content": [
                    "话说大宋仁宗天子在位，嘉祐三年三月三日五更三点，天子驾坐紫宸殿，受百官朝贺。但见：祥云迷凤阁，瑞气罩龙楼。含烟御柳拂旌旗，带露宫花迎剑戟。天香影里，玉簪朱履聚丹墀
；仙乐声中，绣袄锦衣扶御驾。珍珠帘卷，黄金殿上现金轝，凤羽扇开，白玉阶前停宝辇。隐隐净鞭三下响，层层文武两班齐。当有殿头官喝道：“有事出班早奏，无事卷帘退朝。”只见班部丛中，宰相赵哲、参政文
彦博出班奏曰：“目今京师瘟疫盛行，伤损军民甚多。伏望陛下释罪宽恩，省刑薄税，祈禳天灾，救济万民。”天子听奏，急敕翰林院随即草诏，一面降赦天下罪囚，应有民间税赋，悉皆赦免；一面命在京宫观寺院，
修设好事禳灾。不料其年瘟疫转盛，仁宗天子闻知，龙体不安，复会百官计议。向那班部中，有一大臣，越班启奏。天子看时，乃是参知政事范仲淹，拜罢起居，奏曰：“目今天灾盛行，军民涂炭，日夕不能聊生。以
臣愚意，要禳此灾，可宣嗣汉天师星夜临朝，就京师禁院，修设三千六百分罗天大醮，奏闻上帝，可以禳保民间瘟疫。”仁宗天子准奏，急令翰林学士草诏一道，天子御笔亲书，并降御香一炷，钦差内外提点殿前太尉
洪信为天使，前往江西信州龙虎山，宣请嗣汉天师张真人星夜来朝，祈禳瘟疫。就金殿上焚起御香，亲将丹诏付与洪太尉，即便登程前去。洪信领了圣敕，辞别天子，背了诏书，盛了御香，带了数十人，上了铺马，
一行部队，离了东京，取路径投信州贵溪县来。但见：遥山叠翠，远水"
                ],
                "shape": [1]
            },
            "INPUT_1": {
                "content": [
                    200
                ],
                "shape": [1]
            },
            "INPUT_2": {
                "content": [
                    ""
                ],
                "shape": [1]
            },
            "INPUT_3": {
                "content": [
                    ""
                ],

                "content": [
                    1.0
                ],
                "shape": [1]
            },
            "repetition_penalty": {
                "content": [
                    1.03
                ],
                "shape": [1]
            },
            "random_seed": {
                "content": [
                    3
                ],
                "shape": [1]
            },
            "is_return_log_probs": {
                "content": [
                    true
                ],
                "shape": [1]
            },
            "is_return_context_embeddings": {
                "content": [
                    true
                ],
                "shape": [1]
            },
            "beam_width": {
                "content": [
                    1
                ],
                "shape": [1]
            },
            "start_id": {
                "content": [
                    1
                ],
                "shape": [1]
            },
            "end_id": {
                "content": [
                    2
                ],
                "shape": [1]
            }
        }
    ]
}

perf_analyzer command

perf_analyzer -m ensemble --input-data  input_data --measurement-mode=count_windows --concurrency-range 1   -b {batch size}