Continued Pretraining on Llama 7b.

Question

Continued Pretraining on Llama 7b.

wiseyy opened this issue 4 months ago · comments

In continuation to #78 (comment),

I converted the weights as you mentioned, but unfortunately, I cannot get the same sane outputs for the pre-trained llama weights as I get when using HF Api. I am trying to figure out why that is happening. Conversion is straightforward except for the gate_up and qkv weights of nanotron, since the structure of the weights is not mentioned. I assume that concatenating the hf weights in the 0th dimension in the order (gate, up) and (q,k,v) should give the same behaviour for nanotron weights.

The sources of errors I could think of are (assuming there is no bug in run_generate.py):

Order of qkv matrices in the nanotron format.
Storing the transpose of qkv matrices?
Difference in rotary embeddings as compared to HF API.

Could you please help me out?

Update :
The outputs look somewhat sane. However, they are far from acceptable.

Here, for example, it tries to speak but then moves on to generating gibberish. This leads me to believe that the weight mapping is correct and that there is some error in the Generation code.

I want to point out that you are not passing the arguments to the sampler in the decode_text function in generation/decode.py.

The above outputs were generated using decode_tokenized(), which does that. The GenerationArgs were as follows:

The output that HF API generates for the same weights and input tokens is as follows:

The quality is a lot better than the text generated by nanotron.

Also, when I try to prompt the 7b-chat version with a system prompt and user input (the default way), nanotron output breaks altogether.

This is HF->

This is nanotron->

Can you suggest reasonable values for GenerationArguments that can be used to reproduce similar-quality text generation?
Is the generation code doing what it is supposed to do?

XλRI-U5 · Answer 1 · Sun Feb 25 2024 21:08:35 GMT+0800 (China Standard Time)

@NouamaneTazi do we have a conversion script from transformers to nanotron checkpoint?

Vaibhav Mishra · Answer 2 · Fri Mar 01 2024 05:16:29 GMT+0800 (China Standard Time)

Any updates? @xrsrke

Yarden As · Answer 3 · Thu Mar 14 2024 16:09:46 GMT+0800 (China Standard Time)

@wiseyy I'm facing a similar challenge. Any way we can join forces on this and try to make it work? :)

Vaibhav Mishra · Answer 4 · Thu Mar 14 2024 16:16:14 GMT+0800 (China Standard Time)

Glad to know I'm not alone :)

I already chose the easier route to use Megatron-LLM and Meditron. The training throughput, however, is ~2/3 of what nanotron provides. Also, you would have to convert the weights to hf format after you finish training and infer using hf/vllm.

I hope that helps you.

Yarden As · Answer 5 · Thu Mar 14 2024 16:23:01 GMT+0800 (China Standard Time)

@wiseyy unfortunately I can't go the megatron route (I'm part of a group and we already committed ourselves to nanotron).

Conversion is straightforward

Can you help me get started with this? Maybe if I can reproduce your errors I'll be able to dig deeper into the issue