$ python bench_softmax.py --help
usage: bench_softmax.py [-h] [--nwarmup NWARMUP] [--nloop NLOOP] [--batchsize BATCHSIZE] [--maxlen MAXLEN] [--input_length N [N ...]] [--nhead NHEAD] [--device {cpu,cuda}] [--datatype {fp32,fp16,bf16}] [--label LABEL]
Benchmarking softmax.
options:
-h, --help show this help message and exit
--nwarmup NWARMUP Number of warm-up cycles to run before the actual benchmark. Default is 10.
--nloop NLOOP Number of benchmark cycles to run. Default is 100.
--batchsize BATCHSIZE
Number of benchmark cycles to run. Default is 1.
--maxlen MAXLEN softmax input length limit, benchmarking from from 2**3 to 2**(x) < length_limit, cannot be lower than 8
--input_length N [N ...]
an optional list of target length for benchmark, will be combined with maxlen
--nhead NHEAD number of self-attention head. Default is 16 (bert-large).
--device {cpu,cuda} any of ['cpu', 'cuda']
--datatype {fp32,fp16,bf16}
any of ['fp32', 'fp16', 'bf16']
--label LABEL Optional label for the process. Defaults to None.
# Single GPU
CUDA_VISIBLE_DEVICES=0 python bench_softmax.py --label bert-large --maxlen 1024 --device cuda --datatype fp16
# CPU
numactl --cpunodebind=0 --membind=0 python bench_softmax.py --label bert-large --maxlen 1024 --device cpu --datatype bf16
pip install torch pandas matplotlib tqdm