huggingface / nanotron

Minimalistic large language model 3D-parallelism training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Continued Pretraining on Llama 7b.

wiseyy opened this issue · comments

In continuation to #78 (comment),

I converted the weights as you mentioned, but unfortunately, I cannot get the same sane outputs for the pre-trained llama weights as I get when using HF Api. I am trying to figure out why that is happening. Conversion is straightforward except for the gate_up and qkv weights of nanotron, since the structure of the weights is not mentioned. I assume that concatenating the hf weights in the 0th dimension in the order (gate, up) and (q,k,v) should give the same behaviour for nanotron weights.

The sources of errors I could think of are (assuming there is no bug in run_generate.py):

  1. Order of qkv matrices in the nanotron format.
  2. Storing the transpose of qkv matrices?
  3. Difference in rotary embeddings as compared to HF API.

Could you please help me out?

Update :
The outputs look somewhat sane. However, they are far from acceptable.
Screenshot 2024-02-24 at 1 26 46 AM
Here, for example, it tries to speak but then moves on to generating gibberish. This leads me to believe that the weight mapping is correct and that there is some error in the Generation code.

I want to point out that you are not passing the arguments to the sampler in the decode_text function in generation/decode.py.

Screenshot 2024-02-24 at 1 29 43 AM

The above outputs were generated using decode_tokenized(), which does that. The GenerationArgs were as follows:
Screenshot 2024-02-24 at 1 30 36 AM

The output that HF API generates for the same weights and input tokens is as follows:
Screenshot 2024-02-24 at 1 32 19 AM

The quality is a lot better than the text generated by nanotron.

Also, when I try to prompt the 7b-chat version with a system prompt and user input (the default way), nanotron output breaks altogether.

This is HF->
Screenshot 2024-02-24 at 1 35 17 AM
This is nanotron->
Screenshot 2024-02-24 at 1 36 51 AM

  1. Can you suggest reasonable values for GenerationArguments that can be used to reproduce similar-quality text generation?
  2. Is the generation code doing what it is supposed to do?

@NouamaneTazi do we have a conversion script from transformers to nanotron checkpoint?

Any updates? @xrsrke

@wiseyy I'm facing a similar challenge. Any way we can join forces on this and try to make it work? :)

Glad to know I'm not alone :)

I already chose the easier route to use Megatron-LLM and Meditron. The training throughput, however, is ~2/3 of what nanotron provides. Also, you would have to convert the weights to hf format after you finish training and infer using hf/vllm.

I hope that helps you.

@wiseyy unfortunately I can't go the megatron route (I'm part of a group and we already committed ourselves to nanotron).

Conversion is straightforward

Can you help me get started with this? Maybe if I can reproduce your errors I'll be able to dig deeper into the issue