Support for "mistralai/Mistral-7B-Instruct-v0.1" model

Question

Support for "mistralai/Mistral-7B-Instruct-v0.1" model

Matthieu-Tinycoaching opened this issue 10 months ago · comments

Matthieu / Tinycoaching commented 10 months ago

Hi,

Would it be possible to add support for "mistralai/Mistral-7B-Instruct-v0.1" model?

Vincent Nguyen commented 10 months ago

RAG ?

Muhammad Talha Khan commented 9 months ago

Alright.

Vincent Nguyen · Answer 1 · Thu Sep 28 2023 20:50:14 GMT+0800 (China Standard Time)

Just use the llama converter.
it works fine at least for MMLU evaluation even without the sliding window attention implementation .....
Maybe with much longer inputs it may break.

Winston H. · Answer 2 · Fri Sep 29 2023 23:12:59 GMT+0800 (China Standard Time)

Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.

BBC-Esq · Answer 3 · Fri Sep 29 2023 23:37:56 GMT+0800 (China Standard Time)

Retrieval augmented generation, as in creating a vector database and querying it for results, then appending those results to a user's query that are both sent to an LLM for an answer. It lets one ask for an answer from an LLM on specific information that is after a model's knowledge cutoff date, for example. Very powerful.

Vincent Nguyen · Answer 4 · Fri Sep 29 2023 23:44:01 GMT+0800 (China Standard Time)

and what, is the common usage of this with seq length higher than 4096 ?

Winston H. · Answer 5 · Fri Sep 29 2023 23:46:02 GMT+0800 (China Standard Time)

You can certainly do RAG decently under 4096 but typically, the point of RAG is to make use of as much context as possible.

Vincent Nguyen · Answer 6 · Fri Sep 29 2023 23:52:13 GMT+0800 (China Standard Time)

but again, the sliding window is only for the attention mask. it does mean that it will "break".
if something breaks it's just because the sequence length might be way too long and it will OOM by itself.
does not mean results will be bad.
anyway, I am implementing the sliding mask in -py and will check how easy it is to replicate in ct2.

Winston H. · Answer 7 · Fri Sep 29 2023 23:53:50 GMT+0800 (China Standard Time)

You are right, I misunderstood their article. My apologies.

Mrigank Raman · Answer 8 · Sun Oct 01 2023 11:40:34 GMT+0800 (China Standard Time)

Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.

What would be the command to use llama convertor for Mistral?

Winston H. · Answer 9 · Sun Oct 01 2023 22:34:06 GMT+0800 (China Standard Time)

I've uploaded the converted model to Hugging Face. See here.

Vincent Nguyen · Answer 10 · Sun Oct 01 2023 22:35:25 GMT+0800 (China Standard Time)

https://opennmt.net/CTranslate2/guides/transformers.html#llama-2

NeonBohdan · Answer 11 · Mon Oct 02 2023 02:00:17 GMT+0800 (China Standard Time)

Just use the llama converter.
it works fine at least for MMLU evaluation even without the sliding window attention implementation .....
Maybe with much longer inputs it may break.

When I do this

ct2-transformers-converter --model mistralai/Mistral-7B-v0.1 --quantization int8 --output_dir ./models/ctranslate2 --low_cpu_mem_usage

It outputs

ValueError: No conversion is registered for the model configuration MistralConfig

Maybe need to change model type too or what?

Vincent Nguyen · Answer 12 · Mon Oct 02 2023 14:24:53 GMT+0800 (China Standard Time)

did you try to change here: https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197
to MistralConfig
if this is not enough we'll need to add the config, ootherwise you can download directly the the converted file from @winstxnhdw

Manish · Answer 13 · Mon Oct 02 2023 15:46:06 GMT+0800 (China Standard Time)

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

Mrigank Raman · Answer 14 · Mon Oct 02 2023 15:53:28 GMT+0800 (China Standard Time)

@winstxnhdw possible to share how u did the conversion? i am getting the same error
ValueError: No conversion is registered for the model configuration MistralConfig

I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197

copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config

Vincent Nguyen · Answer 15 · Mon Oct 02 2023 15:56:00 GMT+0800 (China Standard Time)

Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens.
Would be interesting to see how it behaves with longer sequences.

Mrigank Raman · Answer 16 · Mon Oct 02 2023 15:57:37 GMT+0800 (China Standard Time)

Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens.
Would be interesting to see how it behaves with longer sequences.

When will ctranslate2 support SWA?

BBC-Esq · Answer 17 · Mon Oct 02 2023 20:45:15 GMT+0800 (China Standard Time)

@winstxnhdw possible to share how u did the conversion? i am getting the same error
ValueError: No conversion is registered for the model configuration MistralConfig
I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197

copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config

Can you please post your code for me instead of a picture of it??

Meng Zhang · Answer 18 · Mon Oct 02 2023 23:55:57 GMT+0800 (China Standard Time)

@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
    @property
    def architecture_name(self):
        return "MistralForCausalLM"

    def get_model_spec(self, model):
        num_layers = model.config.num_hidden_layers

        num_heads = model.config.num_attention_heads
        num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
        if num_heads_kv == num_heads:
            num_heads_kv = None

        spec = transformer_spec.TransformerDecoderModelSpec.from_config(
            num_layers,
            num_heads,
            activation=common_spec.Activation.SWISH,
            pre_norm=True,
            ffn_glu=True,
            rms_norm=True,
            rotary_dim=0,
            rotary_interleave=False,
            num_heads_kv=num_heads_kv,
        )

        self.set_decoder(spec.decoder, model.model)
        self.set_linear(spec.decoder.projection, model.lm_head)
        return spec

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            tokens.append("<extra_id_%d>" % i)

        return tokens

    def set_vocabulary(self, spec, tokens):
        spec.register_vocabulary(tokens)

    def set_config(self, config, model, tokenizer):
        config.bos_token = tokenizer.bos_token
        config.eos_token = tokenizer.eos_token
        config.unk_token = tokenizer.unk_token
        config.layer_norm_epsilon = model.config.rms_norm_eps

    def set_layer_norm(self, spec, layer_norm):
        spec.gamma = layer_norm.weight

    def set_decoder(self, spec, module):
        spec.scale_embeddings = False
        self.set_embeddings(spec.embeddings, module.embed_tokens)
        self.set_layer_norm(spec.layer_norm, module.norm)

        for layer_spec, layer in zip(spec.layer, module.layers):
            self.set_layer_norm(
                layer_spec.self_attention.layer_norm, layer.input_layernorm
            )
            self.set_layer_norm(
                layer_spec.ffn.layer_norm, layer.post_attention_layernorm
            )

            wq = layer.self_attn.q_proj.weight
            wk = layer.self_attn.k_proj.weight
            wv = layer.self_attn.v_proj.weight
            wo = layer.self_attn.o_proj.weight

            layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
            layer_spec.self_attention.linear[1].weight = wo

            self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
            self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
            self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)

            delattr(layer, "self_attn")
            delattr(layer, "mlp")
            gc.collect()

Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.

BBC-Esq · Answer 19 · Tue Oct 03 2023 03:30:58 GMT+0800 (China Standard Time)

@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
    @property
    def architecture_name(self):
        return "MistralForCausalLM"

    def get_model_spec(self, model):
        num_layers = model.config.num_hidden_layers

        num_heads = model.config.num_attention_heads
        num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
        if num_heads_kv == num_heads:
            num_heads_kv = None

        spec = transformer_spec.TransformerDecoderModelSpec.from_config(
            num_layers,
            num_heads,
            activation=common_spec.Activation.SWISH,
            pre_norm=True,
            ffn_glu=True,
            rms_norm=True,
            rotary_dim=0,
            rotary_interleave=False,
            num_heads_kv=num_heads_kv,
        )

        self.set_decoder(spec.decoder, model.model)
        self.set_linear(spec.decoder.projection, model.lm_head)
        return spec

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            tokens.append("<extra_id_%d>" % i)

        return tokens

    def set_vocabulary(self, spec, tokens):
        spec.register_vocabulary(tokens)

    def set_config(self, config, model, tokenizer):
        config.bos_token = tokenizer.bos_token
        config.eos_token = tokenizer.eos_token
        config.unk_token = tokenizer.unk_token
        config.layer_norm_epsilon = model.config.rms_norm_eps

    def set_layer_norm(self, spec, layer_norm):
        spec.gamma = layer_norm.weight

    def set_decoder(self, spec, module):
        spec.scale_embeddings = False
        self.set_embeddings(spec.embeddings, module.embed_tokens)
        self.set_layer_norm(spec.layer_norm, module.norm)

        for layer_spec, layer in zip(spec.layer, module.layers):
            self.set_layer_norm(
                layer_spec.self_attention.layer_norm, layer.input_layernorm
            )
            self.set_layer_norm(
                layer_spec.ffn.layer_norm, layer.post_attention_layernorm
            )

            wq = layer.self_attn.q_proj.weight
            wk = layer.self_attn.k_proj.weight
            wv = layer.self_attn.v_proj.weight
            wo = layer.self_attn.o_proj.weight

            layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
            layer_spec.self_attention.linear[1].weight = wo

            self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
            self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
            self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)

            delattr(layer, "self_attn")
            delattr(layer, "mlp")
            gc.collect()

Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.

Awesome, any change we can get a bfloat ctranslate2 edition since the model is originally in bfloat16? that way we can use quantizations at run time other than int8?

BBC-Esq · Answer 20 · Tue Oct 03 2023 06:46:37 GMT+0800 (China Standard Time)

Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.

Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl, for example. I'm being serious here, since you successfully converted Mistral by modifying the ctranslate2 scripts, I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally. This is very important to me, so hit me up if you want to discuss. I'd be happy to share my credentials, law firm website, or whatever it takes so we can do this and make payment remotely...Thanks.

Carl Silva · Answer 21 · Tue Oct 03 2023 06:53:25 GMT+0800 (China Standard Time)

I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.

we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.

BBC-Esq · Answer 22 · Tue Oct 03 2023 06:55:39 GMT+0800 (China Standard Time)

I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.

we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.

Let's do this, we'll split the cost 50/50 for whatever freelance programmer actually does it. We'll need to discuss the amount of time and first of course. ;-)

Carl Silva · Answer 23 · Tue Oct 03 2023 06:57:31 GMT+0800 (China Standard Time)

confirmed. we are also looking into fine tuning of this model, althought it does not need very much.

from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.

BBC-Esq · Answer 24 · Tue Oct 03 2023 06:59:33 GMT+0800 (China Standard Time)

confirmed. we are also looking into fine tuning of this model, althought it does not need very much.

from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.

I agree, and even though it's a resource hog (relative to other embedding models) it's worth it IMHO.

Winston H. · Answer 25 · Wed Oct 04 2023 03:30:24 GMT+0800 (China Standard Time)

Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl

Can I ask why you've been so incessant to use instructor-xl over bge-large-en when bge-large-en has shown to be more performant and efficient than instructor-xl embeddings in every metric as shown in the leaderboards?

BBC-Esq · Answer 26 · Wed Oct 04 2023 03:33:38 GMT+0800 (China Standard Time)

I've just noticed that it performs significantly better when I use it. Not sure why exactly, I know that different models perform differently depending on the type of text being fed it, but that's just what I've noticed. Any interest?

Carl Silva · Answer 27 · Wed Oct 04 2023 03:34:21 GMT+0800 (China Standard Time)

will check out the leaderboard and runs some tests thx.

Winston H. · Answer 28 · Wed Oct 04 2023 03:44:00 GMT+0800 (China Standard Time)

I've just noticed that it performs significantly better when I use it.

Are you certain that you've appended your instructions with the following when using bge-en-large-1.5?

Represent this sentence for searching relevant passages:

BBC-Esq · Answer 29 · Sat Oct 07 2023 18:30:42 GMT+0800 (China Standard Time)

I'm sorry, are you saying that bge-en-large-1.5 allows you to enter instructions like instructor-xl does?

Winston H. · Answer 30 · Sat Oct 07 2023 18:32:07 GMT+0800 (China Standard Time)

FlagOpen/FlagEmbedding#148

Vincent Nguyen · Answer 31 · Tue Oct 31 2023 22:28:07 GMT+0800 (China Standard Time)

@winstxnhdw do you have the use case to test #1528 it would require passing a very long prompt ( > 4096, maybe double this) and see if it outputs consistent completion.

Winston H. · Answer 32 · Tue Oct 31 2023 22:31:55 GMT+0800 (China Standard Time)

Yeah, easily but I am really busy this week. I can maybe test something this weekend. Will update.

Muhammad Talha Khan · Answer 33 · Wed Nov 01 2023 04:13:02 GMT+0800 (China Standard Time)

Hey guys,

I am facing problem as I am shifting one of my codes to Mistral from GPT.

def get_embedding(text, model="sentence-transformers/all-MiniLM-L6-v2"):
  text = text.replace("\n", " ")
  if not text: 
    text = "this is blank"
  return openai.Embedding.create(
          input=[text], model=model)['data'][0]['embedding']


if __name__ == '__main__':
#   gpt_parameter = {"engine": "text-davinci-003", "max_tokens": 50, 
#                    "temperature": 0, "top_p": 1, "stream": False,
#                    "frequency_penalty": 0, "presence_penalty": 0, 
#                    "stop": ['"']}
  gpt_parameter = {"max_tokens": 50, 
                   "temperature": 0, "top_p": 1, "stream": False,
                   "frequency_penalty": 0, "presence_penalty": 0, 
                   "stop": ['"']}
  
  curr_input = ["driving to a friend's house"]
  prompt_lib_file = "prompt_template/test_prompt_July5.txt"
  prompt = generate_prompt(curr_input, prompt_lib_file)

  def __func_validate(gpt_response): 
    if len(gpt_response.strip()) <= 1:
      return False
    if len(gpt_response.strip().split(" ")) > 1: 
      return False
    return True
  def __func_clean_up(gpt_response):
    cleaned_response = gpt_response.strip()
    return cleaned_response

I wanted to know which "Engine" and "Embedding Model" should be used for MIstral.

Looking forward for help 🙂

Winston H. · Answer 34 · Wed Nov 01 2023 04:19:05 GMT+0800 (China Standard Time)

That's not remotely how you should be using any open-source model, and let's not pollute this issue any further with irrelevant topics. You can create a new issue for this. Also, it might be useful for you to learn what an API client library is first..

Ideally, there should be a discussion tab for such matters. Maybe @guillaumekln can help enable the tab?

Vincent Nguyen · Answer 35 · Fri Nov 03 2023 20:34:11 GMT+0800 (China Standard Time)

I closed #1528 and worked with @minhthuc2502 on #1524.

still WIP, not good so far.

Vincent Nguyen · Answer 36 · Fri Nov 17 2023 18:30:35 GMT+0800 (China Standard Time)

We just merged #1524 great team work with @minhthuc2502
Mistral should now run fine with very long inputs. I just recommend to use int8_float16 when converting, plain float16 may go OOM quite easily on a 24GB GPU.