Converting llama2-based LiteLlama and TinyLlamas models produce incoherent outputs

Question

Converting llama2-based LiteLlama and TinyLlamas models produce incoherent outputs

benjamintli opened this issue 3 months ago · comments

I'm trying to use this README: https://github.com/pytorch/executorch/tree/main/examples/models/llama2 on other llama2 based models like TinyLlamas 1.1B: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0.

Here's the params.json I wrote (based on the config):

{
    "dim": 2048,
    "multiple_of": 256,
    "hidden_dim": 5632,
    "n_heads": 32,
    "n_kv_heads": 4,
    "n_layers": 22,
    "vocab_size": 32000,
    "norm_eps": 1e-05
}

I used the torchtune python script in the README to convert the safetensors in the huggingface repo to a state_dict:

from torchtune.utils import FullModelHFCheckpointer
from torchtune.models import convert_weights
import torch

# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face
checkpointer = FullModelHFCheckpointer(
    checkpoint_dir='/home/<user>/LiteLlama-460M-1T',
    checkpoint_files=['model.safetensors'],
    output_dir='/home/<user>/TinyLlama-1.1B-Chat-v1.0' ,
    model_type='LLAMA2' # or other types that TorchTune supports
)

print("loading checkpoint")
sd = checkpointer.load_checkpoint()
sd = convert_weights.tune_to_meta(sd['model'])

print("saving checkpoint")
torch.save(sd, "/home/<user>/tiny_llama_output/checkpoint.pth")

And here's the command I ran to create the model:

python -m examples.models.llama2.export_llama --checkpoint ~/tiny_llama_output/checkpoint.pth --params ~/tiny_llama_output/params.json -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32

When I try to run this on Android it gives me this

./llama_main_xnn -model_path xnnpack_llama2.pte --prompt "hello"                                                    <
I 00:00:00.001603 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version
I 00:00:00.001723 executorch:cpuinfo_utils.cpp:77] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.001742 executorch:cpuinfo_utils.cpp:157] Number of efficient cores 4
I 00:00:00.001764 executorch:main.cpp:65] Resetting threadpool with num threads = 4
I 00:00:00.004587 executorch:runner.cpp:49] Creating LLaMa runner: model_path=xnnpack_llama2.pte, tokenizer_path=tokenizer.bin
I 00:00:00.872148 executorch:runner.cpp:64] Reading metadata from model
I 00:00:00.872192 executorch:runner.cpp:123] get_vocab_size: 32000
I 00:00:00.872201 executorch:runner.cpp:123] get_bos_id: 1
I 00:00:00.872205 executorch:runner.cpp:123] get_eos_id: 2
I 00:00:00.872209 executorch:runner.cpp:123] get_n_bos: 1
I 00:00:00.872213 executorch:runner.cpp:123] get_n_eos: 1
I 00:00:00.872216 executorch:runner.cpp:123] get_max_seq_len: 128
I 00:00:00.872219 executorch:runner.cpp:123] use_kv_cache: 1
I 00:00:00.872222 executorch:runner.cpp:123] use_sdpa_with_kv_cache: 1
I 00:00:00.872224 executorch:runner.cpp:123] append_eos_to_prompt: 0
hello of the off of does with times- that
 runких经чнеChild prof HauptIg만agi [],oning alleruss barssubsetalo erenlangleich extremauss[-llendenciaingers orIg trifllLL房 систеphaprogrammingingersжданionarioingersauss unusualprogrammingivendent RingчинCR nau Baseball recordinger Er lip chez reactioncolo;& announcedrlнародohlوran Gan изуvl Ligaberend larberryッ straloprogramminglungeniacwissenschaftingersллbazほseinohl davon Dickingers investigateohlingersزDSrayлеijoRecognray Nasensa Braduestanningändohl; Lawrence himsci
chr doctor" developxidelposes
PyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1716690408662,"model_load_end_ms":1716690409542,"inference_start_ms":1716690409542,"inference_end_ms":1716690414674,"prompt_eval_end_ms":1716690409627,"first_token_ms":1716690409664,"aggregate_sampling_time_ms":236,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:06.016120 executorch:runner.cpp:411] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:06.016129 executorch:runner.cpp:417] 	Model Load Time:		0.880000 (seconds)
I 00:00:06.016135 executorch:runner.cpp:427] 	Total inference time:		5.132000 (seconds)		 Rate: 24.356976 (tokens/second)
I 00:00:06.016139 executorch:runner.cpp:435] 		Prompt evaluation:	0.085000 (seconds)		 Rate: 23.529412 (tokens/second)
I 00:00:06.016142 executorch:runner.cpp:446] 		Generated 125 tokens:	5.047000 (seconds)		 Rate: 24.767188 (tokens/second)
I 00:00:06.016145 executorch:runner.cpp:454] 	Time to first generated token:	0.122000 (seconds)
I 00:00:06.016147 executorch:runner.cpp:461] 	Sampling time over 127 tokens:	0.236000 (seconds)

Which is a bit strange, definitely not the right output (this same model using gguf gives decently coherent responses.

Are the instructions in the README for converting llama7b models "supposed" to be applicable to any llama2 architecture model? Anyone know what's up with my setup here? Is there going to be a README/guide for how to convert huggingface formatted LLMs into something runnable in executorch?

Mengtao Yuan · Answer 1 · Thu May 30 2024 00:28:47 GMT+0800 (China Standard Time)

@benjamintli , thanks for reporting the issue! It's likely that the quantization regresses the model quality. It might be expected for smaller models with denser information in weights.

To verify, could you try to remove -X -qmode 8da4w --group_size 128 -d fp32 and see if the the results are better? If so, try to reduce the group_size from 128 to 64 and 32 and see if the quality is improved.