OpenNMT / CTranslate2

Fast inference engine for Transformer models

Home Page:https://opennmt.net/CTranslate2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chat example and TinyLlama/TinyLlama-1.1B-Chat-v1.0

AIWintermuteAI opened this issue · comments

Recently released TinyLlama/TinyLlama-1.1B-Chat-v1.0 can be (partially) successfully converted with the following command:

ct2-transformers-converter --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir tinyllama --quantization int8_float32 --low_cpu_mem_usage

tokenizer.model can be taken from https://huggingface.co/TinyLlama/TinyLlama-1.1B-step-50K-105b
However in conversion process I'm getting the following warning:

Some weights of LlamaForCausalLM were not initialized from the model checkpoint at TinyLlama/TinyLlama-1.1B-Chat-v1.0 and are newly initialized: ['model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

After conversion, the output is coherent, but the model does not appear to be following instructions, i.e.

Loading the model...
[2024-01-01 22:28:19.075] [ctranslate2] [thread 708793] [warning] The compute type inferred from the saved model is bfloat16, but the target device or backend do not support efficient bfloat16 computation. The model weights have been automatically converted to use the float32 compute type instead.

You: Hi!

Llama2: 
Always answer with emojis
<</SYS>>

Can you summarize the instructions for creating a Python script that asks for user input and performs a task using the `random` module?

You: Can you summarize the instructions for creating a Python script that asks for user input and performs a task using the `random` module?

Llama2: 

<|assistant|>
To create a Python script that asks for user input and performs a task using the `random` module, follow these instructions:

1. Open a Python console or IDE (Integrated Development Environment).
2. Create a new file with the `.py` extension and name it `random_task.py`.

Do you think this is the issue with conversion process or the instruction prompts in chat.py?

Okay, running this test script produces correct output:

import ctranslate2
import sentencepiece as spm

generator = ctranslate2.Generator("tinyllama/")
sp = spm.SentencePieceProcessor("tinyllama/tokenizer.model")

prompt = "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes:"
prompt_tokens = sp.encode(prompt, out_type=str)

step_results = generator.generate_tokens(
    prompt_tokens,
    sampling_temperature=0.8,
    sampling_topk=20,
    max_length=128,
)

output_ids = []

for step_result in step_results:
    is_new_word = step_result.token.startswith("▁")

    if is_new_word and output_ids:
        word = sp.decode(output_ids)
        print(word, end=" ", flush=True)
        output_ids = []

    output_ids.append(step_result.token_id)

if output_ids:
    word = sp.decode(output_ids)
    print(word)

So it is the issue with prompting in chat.py.