OpenNMT / CTranslate2

Fast inference engine for Transformer models

Home Page:https://opennmt.net/CTranslate2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for "mistralai/Mistral-7B-Instruct-v0.1" model

Matthieu-Tinycoaching opened this issue · comments

Hi,

Would it be possible to add support for "mistralai/Mistral-7B-Instruct-v0.1" model?

Just use the llama converter.
it works fine at least for MMLU evaluation even without the sliding window attention implementation .....
Maybe with much longer inputs it may break.

Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.

Retrieval augmented generation, as in creating a vector database and querying it for results, then appending those results to a user's query that are both sent to an LLM for an answer. It lets one ask for an answer from an LLM on specific information that is after a model's knowledge cutoff date, for example. Very powerful.

and what, is the common usage of this with seq length higher than 4096 ?

You can certainly do RAG decently under 4096 but typically, the point of RAG is to make use of as much context as possible.

but again, the sliding window is only for the attention mask. it does mean that it will "break".
if something breaks it's just because the sequence length might be way too long and it will OOM by itself.
does not mean results will be bad.
anyway, I am implementing the sliding mask in -py and will check how easy it is to replicate in ct2.

You are right, I misunderstood their article. My apologies.

Just use the llama converter. it works fine at least for MMLU evaluation even without the sliding window attention implementation ..... Maybe with much longer inputs it may break.

What would be the command to use llama convertor for Mistral?

I've uploaded the converted model to Hugging Face. See here.

Just use the llama converter.
it works fine at least for MMLU evaluation even without the sliding window attention implementation .....
Maybe with much longer inputs it may break.

When I do this

ct2-transformers-converter --model mistralai/Mistral-7B-v0.1 --quantization int8 --output_dir ./models/ctranslate2 --low_cpu_mem_usage

It outputs

ValueError: No conversion is registered for the model configuration MistralConfig

Maybe need to change model type too or what?

did you try to change here: https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197
to MistralConfig
if this is not enough we'll need to add the config, ootherwise you can download directly the the converted file from @winstxnhdw

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197

copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config
image

Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens.
Would be interesting to see how it behaves with longer sequences.

Just a nice reminder, this will behave 100% as Mistral as long as the the sequence length is <=4096 tokens.
Would be interesting to see how it behaves with longer sequences.

When will ctranslate2 support SWA?

@winstxnhdw possible to share how u did the conversion? i am getting the same error

ValueError: No conversion is registered for the model configuration MistralConfig

I solved it. I went to https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1197

copied llama_loader, created a new function and registered MistralConfig with the new function. Basically copy llama loader and register mistral config image

Can you please post your code for me instead of a picture of it??

@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
    @property
    def architecture_name(self):
        return "MistralForCausalLM"

    def get_model_spec(self, model):
        num_layers = model.config.num_hidden_layers

        num_heads = model.config.num_attention_heads
        num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
        if num_heads_kv == num_heads:
            num_heads_kv = None

        spec = transformer_spec.TransformerDecoderModelSpec.from_config(
            num_layers,
            num_heads,
            activation=common_spec.Activation.SWISH,
            pre_norm=True,
            ffn_glu=True,
            rms_norm=True,
            rotary_dim=0,
            rotary_interleave=False,
            num_heads_kv=num_heads_kv,
        )

        self.set_decoder(spec.decoder, model.model)
        self.set_linear(spec.decoder.projection, model.lm_head)
        return spec

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            tokens.append("<extra_id_%d>" % i)

        return tokens

    def set_vocabulary(self, spec, tokens):
        spec.register_vocabulary(tokens)

    def set_config(self, config, model, tokenizer):
        config.bos_token = tokenizer.bos_token
        config.eos_token = tokenizer.eos_token
        config.unk_token = tokenizer.unk_token
        config.layer_norm_epsilon = model.config.rms_norm_eps

    def set_layer_norm(self, spec, layer_norm):
        spec.gamma = layer_norm.weight

    def set_decoder(self, spec, module):
        spec.scale_embeddings = False
        self.set_embeddings(spec.embeddings, module.embed_tokens)
        self.set_layer_norm(spec.layer_norm, module.norm)

        for layer_spec, layer in zip(spec.layer, module.layers):
            self.set_layer_norm(
                layer_spec.self_attention.layer_norm, layer.input_layernorm
            )
            self.set_layer_norm(
                layer_spec.ffn.layer_norm, layer.post_attention_layernorm
            )

            wq = layer.self_attn.q_proj.weight
            wk = layer.self_attn.k_proj.weight
            wv = layer.self_attn.v_proj.weight
            wo = layer.self_attn.o_proj.weight

            layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
            layer_spec.self_attention.linear[1].weight = wo

            self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
            self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
            self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)

            delattr(layer, "self_attn")
            delattr(layer, "mlp")
            gc.collect()

Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.

@register_loader("MistralConfig")
class MistralLoader(ModelLoader):
    @property
    def architecture_name(self):
        return "MistralForCausalLM"

    def get_model_spec(self, model):
        num_layers = model.config.num_hidden_layers

        num_heads = model.config.num_attention_heads
        num_heads_kv = getattr(model.config, "num_key_value_heads", num_heads)
        if num_heads_kv == num_heads:
            num_heads_kv = None

        spec = transformer_spec.TransformerDecoderModelSpec.from_config(
            num_layers,
            num_heads,
            activation=common_spec.Activation.SWISH,
            pre_norm=True,
            ffn_glu=True,
            rms_norm=True,
            rotary_dim=0,
            rotary_interleave=False,
            num_heads_kv=num_heads_kv,
        )

        self.set_decoder(spec.decoder, model.model)
        self.set_linear(spec.decoder.projection, model.lm_head)
        return spec

    def get_vocabulary(self, model, tokenizer):
        tokens = super().get_vocabulary(model, tokenizer)

        extra_ids = model.config.vocab_size - len(tokens)
        for i in range(extra_ids):
            tokens.append("<extra_id_%d>" % i)

        return tokens

    def set_vocabulary(self, spec, tokens):
        spec.register_vocabulary(tokens)

    def set_config(self, config, model, tokenizer):
        config.bos_token = tokenizer.bos_token
        config.eos_token = tokenizer.eos_token
        config.unk_token = tokenizer.unk_token
        config.layer_norm_epsilon = model.config.rms_norm_eps

    def set_layer_norm(self, spec, layer_norm):
        spec.gamma = layer_norm.weight

    def set_decoder(self, spec, module):
        spec.scale_embeddings = False
        self.set_embeddings(spec.embeddings, module.embed_tokens)
        self.set_layer_norm(spec.layer_norm, module.norm)

        for layer_spec, layer in zip(spec.layer, module.layers):
            self.set_layer_norm(
                layer_spec.self_attention.layer_norm, layer.input_layernorm
            )
            self.set_layer_norm(
                layer_spec.ffn.layer_norm, layer.post_attention_layernorm
            )

            wq = layer.self_attn.q_proj.weight
            wk = layer.self_attn.k_proj.weight
            wv = layer.self_attn.v_proj.weight
            wo = layer.self_attn.o_proj.weight

            layer_spec.self_attention.linear[0].weight = torch.cat([wq, wk, wv])
            layer_spec.self_attention.linear[1].weight = wo

            self.set_linear(layer_spec.ffn.linear_0, layer.mlp.gate_proj)
            self.set_linear(layer_spec.ffn.linear_0_noact, layer.mlp.up_proj)
            self.set_linear(layer_spec.ffn.linear_1, layer.mlp.down_proj)

            delattr(layer, "self_attn")
            delattr(layer, "mlp")
            gc.collect()

Here's a snippet which I succesfully conducted the convertion. Not sure if it's good to send out a PR - given the sliding window support is not there yet.

Awesome, any change we can get a bfloat ctranslate2 edition since the model is originally in bfloat16? that way we can use quantizations at run time other than int8?

Most people using Mistral will be using it for RAG, meaning it'll probably break without the sliding window attention.

Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl, for example. I'm being serious here, since you successfully converted Mistral by modifying the ctranslate2 scripts, I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally. This is very important to me, so hit me up if you want to discuss. I'd be happy to share my credentials, law firm website, or whatever it takes so we can do this and make payment remotely...Thanks.

I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.

we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.

I will actually pay you (or anyone) if they either modify the ctranslate2 codebase or customize the scripts for me personally.

we second this, althought we are focussed on healthcare i..e the pay part. ctranslate2 is awesome.

Let's do this, we'll split the cost 50/50 for whatever freelance programmer actually does it. We'll need to discuss the amount of time and first of course. ;-)

confirmed. we are also looking into fine tuning of this model, althought it does not need very much.

from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.

confirmed. we are also looking into fine tuning of this model, althought it does not need very much.

from our tests this model works the best out of the box vanilla with a variety of tests we have dor our use case.

I agree, and even though it's a resource hog (relative to other embedding models) it's worth it IMHO.

Speaking of RAG. My other posts have been inquiring about getting ctranslate2 to work with the "instructor" class of embedding models like instructor-xl

Can I ask why you've been so incessant to use instructor-xl over bge-large-en when bge-large-en has shown to be more performant and efficient than instructor-xl embeddings in every metric as shown in the leaderboards?

I've just noticed that it performs significantly better when I use it. Not sure why exactly, I know that different models perform differently depending on the type of text being fed it, but that's just what I've noticed. Any interest?

will check out the leaderboard and runs some tests thx.

I've just noticed that it performs significantly better when I use it.

Are you certain that you've appended your instructions with the following when using bge-en-large-1.5?

Represent this sentence for searching relevant passages:

I'm sorry, are you saying that bge-en-large-1.5 allows you to enter instructions like instructor-xl does?

@winstxnhdw do you have the use case to test #1528 it would require passing a very long prompt ( > 4096, maybe double this) and see if it outputs consistent completion.

Yeah, easily but I am really busy this week. I can maybe test something this weekend. Will update.

Hey guys,

I am facing problem as I am shifting one of my codes to Mistral from GPT.

def get_embedding(text, model="sentence-transformers/all-MiniLM-L6-v2"):
  text = text.replace("\n", " ")
  if not text: 
    text = "this is blank"
  return openai.Embedding.create(
          input=[text], model=model)['data'][0]['embedding']


if __name__ == '__main__':
#   gpt_parameter = {"engine": "text-davinci-003", "max_tokens": 50, 
#                    "temperature": 0, "top_p": 1, "stream": False,
#                    "frequency_penalty": 0, "presence_penalty": 0, 
#                    "stop": ['"']}
  gpt_parameter = {"max_tokens": 50, 
                   "temperature": 0, "top_p": 1, "stream": False,
                   "frequency_penalty": 0, "presence_penalty": 0, 
                   "stop": ['"']}
  
  curr_input = ["driving to a friend's house"]
  prompt_lib_file = "prompt_template/test_prompt_July5.txt"
  prompt = generate_prompt(curr_input, prompt_lib_file)

  def __func_validate(gpt_response): 
    if len(gpt_response.strip()) <= 1:
      return False
    if len(gpt_response.strip().split(" ")) > 1: 
      return False
    return True
  def __func_clean_up(gpt_response):
    cleaned_response = gpt_response.strip()
    return cleaned_response

I wanted to know which "Engine" and "Embedding Model" should be used for MIstral.

Looking forward for help 🙂

That's not remotely how you should be using any open-source model, and let's not pollute this issue any further with irrelevant topics. You can create a new issue for this. Also, it might be useful for you to learn what an API client library is first..

Ideally, there should be a discussion tab for such matters. Maybe @guillaumekln can help enable the tab?

I closed #1528 and worked with @minhthuc2502 on #1524.

still WIP, not good so far.

We just merged #1524 great team work with @minhthuc2502
Mistral should now run fine with very long inputs. I just recommend to use int8_float16 when converting, plain float16 may go OOM quite easily on a 24GB GPU.