slight performance improving(ㄒoㄒ)
480284856 opened this issue · comments
I only got a little improvement than the native code. Was there any I missed?
Commands
cli 1:
time python generate.py --compile --compile_prefill --checkpoint_path /root/gpt-fast/codellama-34b-python/model_int8.pth --prompt "def quicksort(arr):" --max_new_tokens 32 --num_samples 50
cli 2:
time python generate.py --checkpoint_path /root/gpt-fast/codellama-34b-python/model_int8.pth --prompt "def quicksort(arr):" --max_new_tokens 32 --num_samples 50
Results
result of cli 1: 4.45tokens/sec & 151.52GB/s for bandwidth
result of cli 2: 4.24tokens/sec & 144.55GB/s for bandwidth
relative improvement(compile vs not compile):
speed: 4.9%
memory bandwidth: 4.8%
Env
gpu: 1*L40S
docker: python:3.9
pytorch installation: pip install torch
Are you using pytorch nightly? This perf seems much worse than I would expect