Error when using TinyLlama
trickster opened this issue · comments
TinyLlama uses the same architecture and tokenizer as Llama 2.
When I am trying to create serving, I get the following error
Here is the full output of a livebook
System.put_env("EXLA_TARGET", "cuda120")
Mix.install([
{:bumblebee, github: "elixir-nx/bumblebee"},
{:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
{:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
{:kino, "~> 0.11.0"}
])
Application.put_env(:exla, :clients,
cuda: [platform: :cuda, preallocate: false],
rocm: [platform: :rocm, preallocate: false],
tpu: [platform: :tpu, preallocate: false],
host: [platform: :host, preallocate: false]
)
Nx.global_default_backend(EXLA.Backend)
Section
llama = "TinyLlama/TinyLlama-1.1B-Chat-v0.4"
# llama = "cognitivecomputations/dolphin-llama2-7b"
"TinyLlama/TinyLlama-1.1B-Chat-v0.4"
{:ok, model_info} = Bumblebee.load_model({:hf, llama})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, llama})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, llama})
generation_config = Bumblebee.configure(generation_config, max_new_tokens: 500)
09:57:51.608 [info] XLA service 0x7f8280020c90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
09:57:51.608 [info] StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9
09:57:51.609 [info] Using BFC allocator.
09:57:51.609 [info] XLA backend will use up to 21225406464 bytes on device 0 for BFCAllocator.
09:57:51.844 [info] Loaded cuDNN version 8904
09:57:51.858 [info] Using nvlink for parallel linking
%Bumblebee.Text.GenerationConfig{
max_new_tokens: 500,
min_new_tokens: nil,
max_length: nil,
min_length: nil,
strategy: %{type: :greedy_search},
decoder_start_token_id: nil,
forced_bos_token_id: nil,
forced_eos_token_id: nil,
forced_token_ids: [],
suppressed_token_ids: [],
no_repeat_ngram_length: nil,
temperature: nil,
bos_token_id: nil,
eos_token_id: nil,
pad_token_id: nil,
extra_config: nil
}
tiny_llama_serving =
Bumblebee.Text.generation(model_info, tokenizer, generation_config,
compile: [batch_size: 1, sequence_length: 1028],
preallocate_params: true,
stream: true,
defn_options: [debug: true, client: :cuda, compiler: EXLA]
)
%Nx.Serving{
module: Nx.Serving.Default,
arg: #Function<0.20657473/2 in Bumblebee.Text.TextGeneration.generation/4>,
client_preprocessing: #Function<1.20657473/1 in Bumblebee.Text.TextGeneration.generation/4>,
client_postprocessing: #Function<2.20657473/2 in Bumblebee.Text.TextGeneration.maybe_stream/3>,
streaming: %{hooks: [:token]},
batch_size: 1,
distributed_postprocessing: &Function.identity/1,
process_options: [batch_keys: [sequence_length: 1028]],
defn_options: [debug: true, client: :cuda, compiler: EXLA]
}
Kino.start_child({Nx.Serving, name: TinyLlamaServing, serving: tiny_llama_serving})
{:error,
{:shutdown,
{:failed_to_start_child, Nx.Serving,
{%Protocol.UndefinedError{protocol: Nx.LazyContainer, value: nil, description: ""},
[
{Nx.LazyContainer.Atom, :traverse, 3, [file: ~c"lib/nx/lazy_container.ex", line: 91]},
{Nx, :to_tensor, 1, [file: ~c"lib/nx.ex", line: 2067]},
{Nx, :broadcast, 3, [file: ~c"lib/nx.ex", line: 3702]},
{Bumblebee.Text.Generation, :"__defn:init_sequences__", 3,
[file: ~c"lib/bumblebee/text/generation.ex", line: 469]},
{Bumblebee.Text.Generation, :"__defn:greedy__", 7,
[file: ~c"lib/bumblebee/text/generation.ex", line: 419]},
{Bumblebee.Text.Generation, :"__defn:generate_impl__", 8,
[file: ~c"lib/bumblebee/text/generation.ex", line: 357]},
{Nx.Defn.Compiler, :runtime_fun, 3, [file: ~c"lib/nx/defn/compiler.ex", line: 173]},
{EXLA.Defn, :"-compile/8-fun-3-", 4, [file: ~c"lib/exla/defn.ex", line: 411]}
]}}}}
user_input = Kino.Input.textarea("User prompt", default: "What is love?")
user = Kino.Input.read(user_input)
prompt = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
#{user} [/INST] \
"""
Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)
I get :error, {:shutdown, {:failed_to_start_child, Nx.Serving,
error
The generation_config.json
doesn't have pad_token_id
nor eos_token_id
, which should generally be set. The model card says it's a fine tuned version of TinyLlama/TinyLlama-1.1B-intermediate-step-715k-1.5T
, which does have these in the config. You can set these manually:
generation_config = Bumblebee.configure(generation_config, pad_token_id: 0, eos_token_id: 1, bos_token_id: 2)
We should have a better error message, so let's keep this open.
Closed in e59bb28.