slight performance improving(ㄒoㄒ)

Question

slight performance improving(ㄒoㄒ)

480284856 opened this issue 6 months ago · comments

I only got a little improvement than the native code. Was there any I missed?

Commands

cli 1:
time python generate.py --compile --compile_prefill --checkpoint_path /root/gpt-fast/codellama-34b-python/model_int8.pth --prompt "def quicksort(arr):" --max_new_tokens 32 --num_samples 50

cli 2:
time python generate.py --checkpoint_path /root/gpt-fast/codellama-34b-python/model_int8.pth --prompt "def quicksort(arr):" --max_new_tokens 32 --num_samples 50

Results

result of cli 1: 4.45tokens/sec & 151.52GB/s for bandwidth
result of cli 2: 4.24tokens/sec & 144.55GB/s for bandwidth

relative improvement(compile vs not compile):
speed: 4.9%
memory bandwidth: 4.8%

Env

gpu： 1*L40S
docker: python:3.9
pytorch installation: pip install torch

Horace He · Answer 1 · Fri Dec 15 2023 09:51:35 GMT+0800 (China Standard Time)

Are you using pytorch nightly? This perf seems much worse than I would expect