Add phi3-mini

Question

Add phi3-mini

kyakuno opened this issue a month ago · comments

Kazuki Kyakuno commented a month ago

MicrosoftのminiサイズのLLM。
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

Kazuki Kyakuno · Answer 1 · Wed Apr 24 2024 13:45:59 GMT+0800 (China Standard Time)

公式でonnxが提供されるかも。
https://onnxruntime.ai/blogs/accelerating-phi-3

Kazuki Kyakuno · Answer 2 · Wed Apr 24 2024 19:45:34 GMT+0800 (China Standard Time)

公式でonnxが提供された。
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx

Kazuki Kyakuno · Answer 3 · Wed Apr 24 2024 19:49:29 GMT+0800 (China Standard Time)

generate apiはpythonで書く必要がある。

Kazuki Kyakuno · Answer 4 · Wed Apr 24 2024 20:11:36 GMT+0800 (China Standard Time)

推論コードの例。
microsoft/onnxruntime#20448

Kazuki Kyakuno · Answer 5 · Wed Apr 24 2024 20:11:46 GMT+0800 (China Standard Time)

onnxruntimeのベータ版であれば下記で動く。

import onnxruntime_genai as og
import argparse
import time

model = og.Model(".\Phi-3-mini-128k-instruct-onnx\cpu_and_mobile\cpu-int4-rtn-block-32")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()


def input_llm(text):
    print("Question:",text)
    input_tokens = tokenizer.encode(text)
    params = og.GeneratorParams(model)
    params.try_use_cuda_graph_with_max_batch_size(1)
    params.input_ids = input_tokens
    generator = og.Generator(model, params)
    return generator

def output_llm(generator):
    print("Answer:")
    stt = time.time()
    list_error = []
    list_sentence = []
    while not generator.is_done():
        generator.compute_logits()
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        if not new_token in list_error:
            try:
                list_sentence.append(tokenizer_stream.decode(new_token))
            except:
                list_error.append(new_token)
                list_sentence.append(new_token)
    print(list_sentence)
    fin = time.time()
    print(fin-stt)
    return list_error

Kazuki Kyakuno · Answer 6 · Wed Apr 24 2024 20:13:11 GMT+0800 (China Standard Time)

onnxruntime_genaiのコード。
https://github.com/microsoft/onnxruntime-genai

Kazuki Kyakuno · Answer 7 · Wed Apr 24 2024 20:14:02 GMT+0800 (China Standard Time)

generateはC++で書かれているので、Pytorch向けの実装を持ってきた方が良さそう。

Kazuki Kyakuno · Answer 8 · Wed Apr 24 2024 20:15:09 GMT+0800 (China Standard Time)

とりあえずtokenizerはtransformersを使うと良さそう。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

messages = [
    {"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

Kazuki Kyakuno · Answer 9 · Wed Apr 24 2024 20:16:07 GMT+0800 (China Standard Time)

文章生成はとりあえずgreedy searchとか。
https://github.com/axinc-ai/ailia-models/blob/master/natural_language_processing/rinna_gpt2/utils_rinna_gpt2.py

Kazuki Kyakuno · Answer 10 · Wed Apr 24 2024 20:34:24 GMT+0800 (China Standard Time)

LlamaTokenizerを使っている。
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/tokenizer_config.json

Kazuki Kyakuno · Answer 11 · Wed Apr 24 2024 20:35:18 GMT+0800 (China Standard Time)

LlamaTokenizer
https://github.com/huggingface/transformers/blob/37fa1f654f17b68bbe30440c64e611f1a4d55bc7/src/transformers/models/llama/tokenization_llama.py#L55

Kazuki Kyakuno · Answer 12 · Wed Apr 24 2024 20:36:16 GMT+0800 (China Standard Time)

SentencePieceの一般的なTokenizerに見える。