Error when using TinyLlama

Question

Error when using TinyLlama

trickster opened this issue 6 months ago · comments

TinyLlama uses the same architecture and tokenizer as Llama 2.

When I am trying to create serving, I get the following error

Here is the full output of a livebook

System.put_env("EXLA_TARGET", "cuda120")

Mix.install([
  {:bumblebee, github: "elixir-nx/bumblebee"},
  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
  {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
  {:kino, "~> 0.11.0"}
])

Application.put_env(:exla, :clients,
  cuda: [platform: :cuda, preallocate: false],
  rocm: [platform: :rocm, preallocate: false],
  tpu: [platform: :tpu, preallocate: false],
  host: [platform: :host, preallocate: false]
)

Nx.global_default_backend(EXLA.Backend)

Section

llama = "TinyLlama/TinyLlama-1.1B-Chat-v0.4"
# llama = "cognitivecomputations/dolphin-llama2-7b"

"TinyLlama/TinyLlama-1.1B-Chat-v0.4"

{:ok, model_info} = Bumblebee.load_model({:hf, llama})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, llama})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, llama})
generation_config = Bumblebee.configure(generation_config, max_new_tokens: 500)


09:57:51.608 [info] XLA service 0x7f8280020c90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

09:57:51.608 [info]   StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9

09:57:51.609 [info] Using BFC allocator.

09:57:51.609 [info] XLA backend will use up to 21225406464 bytes on device 0 for BFCAllocator.

09:57:51.844 [info] Loaded cuDNN version 8904

09:57:51.858 [info] Using nvlink for parallel linking

%Bumblebee.Text.GenerationConfig{
  max_new_tokens: 500,
  min_new_tokens: nil,
  max_length: nil,
  min_length: nil,
  strategy: %{type: :greedy_search},
  decoder_start_token_id: nil,
  forced_bos_token_id: nil,
  forced_eos_token_id: nil,
  forced_token_ids: [],
  suppressed_token_ids: [],
  no_repeat_ngram_length: nil,
  temperature: nil,
  bos_token_id: nil,
  eos_token_id: nil,
  pad_token_id: nil,
  extra_config: nil
}

tiny_llama_serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 1028],
    preallocate_params: true,
    stream: true,
    defn_options: [debug: true, client: :cuda, compiler: EXLA]
  )

%Nx.Serving{
  module: Nx.Serving.Default,
  arg: #Function<0.20657473/2 in Bumblebee.Text.TextGeneration.generation/4>,
  client_preprocessing: #Function<1.20657473/1 in Bumblebee.Text.TextGeneration.generation/4>,
  client_postprocessing: #Function<2.20657473/2 in Bumblebee.Text.TextGeneration.maybe_stream/3>,
  streaming: %{hooks: [:token]},
  batch_size: 1,
  distributed_postprocessing: &Function.identity/1,
  process_options: [batch_keys: [sequence_length: 1028]],
  defn_options: [debug: true, client: :cuda, compiler: EXLA]
}

Kino.start_child({Nx.Serving, name: TinyLlamaServing, serving: tiny_llama_serving})

{:error,
 {:shutdown,
  {:failed_to_start_child, Nx.Serving,
   {%Protocol.UndefinedError{protocol: Nx.LazyContainer, value: nil, description: ""},
    [
      {Nx.LazyContainer.Atom, :traverse, 3, [file: ~c"lib/nx/lazy_container.ex", line: 91]},
      {Nx, :to_tensor, 1, [file: ~c"lib/nx.ex", line: 2067]},
      {Nx, :broadcast, 3, [file: ~c"lib/nx.ex", line: 3702]},
      {Bumblebee.Text.Generation, :"__defn:init_sequences__", 3,
       [file: ~c"lib/bumblebee/text/generation.ex", line: 469]},
      {Bumblebee.Text.Generation, :"__defn:greedy__", 7,
       [file: ~c"lib/bumblebee/text/generation.ex", line: 419]},
      {Bumblebee.Text.Generation, :"__defn:generate_impl__", 8,
       [file: ~c"lib/bumblebee/text/generation.ex", line: 357]},
      {Nx.Defn.Compiler, :runtime_fun, 3, [file: ~c"lib/nx/defn/compiler.ex", line: 173]},
      {EXLA.Defn, :"-compile/8-fun-3-", 4, [file: ~c"lib/exla/defn.ex", line: 411]}
    ]}}}}

user_input = Kino.Input.textarea("User prompt", default: "What is love?")

user = Kino.Input.read(user_input)

prompt = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
#{user} [/INST] \
"""

Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)

I get :error, {:shutdown, {:failed_to_start_child, Nx.Serving, error

Jonatan Kłosko · Answer 1 · Wed Jan 24 2024 02:06:51 GMT+0800 (China Standard Time)

The generation_config.json doesn't have pad_token_id nor eos_token_id, which should generally be set. The model card says it's a fine tuned version of TinyLlama/TinyLlama-1.1B-intermediate-step-715k-1.5T, which does have these in the config. You can set these manually:

generation_config = Bumblebee.configure(generation_config, pad_token_id: 0, eos_token_id: 1, bos_token_id: 2)

We should have a better error message, so let's keep this open.

Jonatan Kłosko · Answer 2 · Tue Feb 13 2024 19:51:12 GMT+0800 (China Standard Time)

Closed in e59bb28.