Problem converting Phi3-instruct-128k; "su" rope scaling in Phi-3

Question

Problem converting Phi3-instruct-128k; "su" rope scaling in Phi-3

BBC-Esq opened this issue 3 months ago · comments

Hello peeps, it's me again. The new converter works great with Phi3 but doesn't work with the 128k version located here:

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

After much chagrin, I had a scintillating conversation with Claude Opus and he/she/it gave me an outline of what do to. However, I'm posting the errors I received as well for your benefit. Hope this helps!

Trying to get it to work with the phi3-instruct-128k model. I ran converter.py in the main branch and it gave me this error, in relevant part:

ERROR

``` Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "D:\Scripts\benchmark_chat\Scripts\ct2-transformers-converter.exe\__main__.py", line 7, in File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\transformers.py", line 2200, in main converter.convert_from_args(args) File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\converter.py", line 50, in convert_from_args return self.convert( ^^^^^^^^^^^^^ File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\converter.py", line 89, in convert model_spec = self._load() ^^^^^^^^^^^^ File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\transformers.py", line 141, in _load spec = loader(model, tokenizer) ^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\transformers.py", line 193, in __call__ spec = self.get_model_spec(model) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\transformers.py", line 1698, in get_model_spec rotary_scaling_factor = rope_scaling["factor"] ~~~~~~~~~~~~^^^^^^^^^^ KeyError: 'factor' ```

Chat-gpt said to modify it as set forth in this pull request, and now it's giving me a different error saying that ctranslate2 only supports "linear" role scaling and that it needs to use su whatever that is.

NEW ERROR

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\Scripts\benchmark_chat\Scripts\ct2-transformers-converter.exe\__main__.py", line 7, in <module>
  File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\transformers.py", line 2199, in main
    converter.convert_from_args(args)
  File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\converter.py", line 50, in convert_from_args
    return self.convert(
           ^^^^^^^^^^^^^
  File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\converter.py", line 89, in convert
    model_spec = self._load()
                 ^^^^^^^^^^^^
  File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\transformers.py", line 141, in _load
    spec = loader(model, tokenizer)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\transformers.py", line 193, in __call__
    spec = self.get_model_spec(model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\benchmark_chat\Lib\site-packages\ctranslate2\converters\transformers.py", line 1700, in get_model_spec
    raise NotImplementedError(
NotImplementedError: RoPE scaling type 'su' is not yet implemented. The following RoPE scaling types are currently supported: linear

Since I don't even now what "rope" is let alone "linear" or "su," I've done this legwork and am now passing it off to you all as the experts. Hope this helps. Would be good to be able to use this model in general and bench it.

[EDIT]

Here's some additional legwork that I did, hope that it helps!

Here's what Claude Opus said after some minor questioning and feeding of scripts:

Update the _SUPPORTED_ROPE_SCALING dictionary:

Open the transformers.py file in the CTranslate2 converter.
Locate the _SUPPORTED_ROPE_SCALING dictionary.
Add an entry for the 'su' scaling type, mapping it to the corresponding attention_spec.RotaryScalingType enum value.

Modify the RotaryScalingType enum:

Open the attention_spec.py file.
Find the RotaryScalingType enum definition.
Add a new enum value for the 'su' scaling type.

Update the CTranslate2 library's C++ code:

Open the include/ctranslate2/layers/attention.h file.
Locate the RotaryScalingType enum definition.
Add a new enum value for the 'su' scaling type.
Open the src/layers/attention.cc file.
Find the relevant functions that handle RoPE scaling (e.g., dot_product_attention).
Modify these functions to handle the 'su' scaling type correctly based on its mathematical formulation.

BBC-Esq · Answer 1 · Sun Apr 28 2024 20:03:29 GMT+0800 (China Standard Time)

Did some additional legwork on this "su" scalilng and here's what I came up with...hope it helps, and hope that implementing it still allows someone to use the new flash attention. And as I'm learning, apparently useful when working with large language models to be knowledgeable about a little thing called "math..."

Link to su rope scaling as a jumping off point for ya...

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/810c2e601a82110f6b709183a8ab5416d4a27f75/modeling_phi3.py#L142

Here's a summary of how it's implemented overall in the script, unless I'm mistaken...

Phi3SuScaledRotaryEmbedding Class

Inheritance: Inherits from Phi3RotaryEmbedding.
Initialization:
- Initializes self.short_factor and self.long_factor from config.rope_scaling.
  - These factors are used to scale the frequency of the rotary embeddings based on the sequence length.
- Initializes self.original_max_position_embeddings from config.original_max_position_embeddings.
  - This value is used as a threshold to determine whether to apply the short_factor or long_factor scaling.
Method Overrides:
- Overrides the forward method to apply "su" rope scaling based on sequence length:
  - If the sequence length is greater than self.original_max_position_embeddings, it applies the long_factor scaling.
  - Otherwise, it applies the short_factor scaling.
  - The scaling is done by multiplying the inverse frequency (self.inv_freq) by the respective factor.
  - The scaled inverse frequency is then used to compute the rotary embeddings.
  - The embeddings are further scaled by a scaling_factor that depends on the ratio of max_position_embeddings to original_max_position_embeddings.
  - The resulting scaled cosine and sine embeddings are returned.

Phi3Attention Class

Method Details:
- In the _init_rope method:
  - Checks if self.rope_scaling is not None.
  - If rope scaling configuration is provided, it determines the scaling type based on self.config.rope_scaling["type"].
- If scaling_type == "su":
  - Initializes self.rotary_emb as an instance of Phi3SuScaledRotaryEmbedding.
  - This ensures that the "su" rope scaling is applied to the rotary embeddings during the attention computation.
- The Phi3SuScaledRotaryEmbedding instance is created with the appropriate configuration, including the dim (head dimension) and config (model configuration).