abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TypeError: torch_replacement_knn_gpu() got an unexpected keyword argument 'device'

jordancole21 opened this issue · comments

Hey looks like I'm having some issues working with Llama models. This is the modified script I'm using:

!python run_generation.py --model_type llama --model_name_or_path psmathur/orca_mini_3b \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \
    --index_devices 1 --datastore_device 0

But I get this error:

2023-08-14 14:28:33.395015: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
08/14/2023 14:28:35 - WARNING - __main__ - device: cuda, n_gpu: 1, 16-bits training: True
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565, and set the legacy attribute accordingly.
Loading checkpoint shards: 100% 3/3 [00:08<00:00,  2.95s/it]
08/14/2023 14:29:16 - INFO - __main__ - Namespace(model_type='llama', model_name_or_path='psmathur/orca_mini_3b', prompt='example_inputs/harry_potter_full.txt', length=200, num_hidden_layers=None, stop_token=None, temperature=1.0, repetition_penalty=1.0, k=0, p=0.9, prefix='<<SYS>>\\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \\n<</SYS>>\\n\\n [INST] Summarize the following book: ', suffix=' [/INST]', padding_text='', xlm_language='', seed=42, no_cuda=False, stream_output=False, num_return_sequences=1, fp16=True, jit=False, device=device(type='cuda'), n_gpu=1)
08/14/2023 14:29:16 - INFO - Unlimiformer - Encoding 0 to 65 out of 65
Traceback (most recent call last):
  File "/content/unlimiformer/src/run_generation.py", line 577, in <module>
    main()
  File "/content/unlimiformer/src/run_generation.py", line 532, in main
    output_sequences = model.generate(
  File "/content/unlimiformer/src/unlimiformer.py", line 529, in pre_generate_hook
    return self.original_generate_func(input_ids_prefix, **new_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2724, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/unlimiformer/src/unlimiformer.py", line 551, in pre_forward_hook
    result = self.original_forward_func(input_ids=input_ids, labels=labels, attention_mask=attention_mask, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 810, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 698, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/unlimiformer/src/unlimiformer.py", line 575, in attention_pre_forward_hook
    result = original_cross_attn_forward_func(hidden_states=hidden_states, attention_mask=attention_mask, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 310, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1547, in _call_impl
    hook_result = hook(self, args, result)
  File "/content/unlimiformer/src/unlimiformer.py", line 629, in attention_forward_hook
    _, top_search_key_indices = self.datastore[datastore_index].search(datastore_query, k=topk)
  File "/content/unlimiformer/src/index_building.py", line 34, in search
    scores, values = self.indices[i].search(queries[i], k)
  File "/content/unlimiformer/src/index_building.py", line 144, in search
    scores, values = faiss.knn_gpu(faiss.StandardGpuResources(), queries, self.keys, k, 
TypeError: torch_replacement_knn_gpu() got an unexpected keyword argument 'device'

Any ideas on how to fix that?

Thanks again for all the help and for the new features!

I have the same issue with you, and here's my script:
CUDA_VISIBLE_DEVICES=0 python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-7b-chat-hf --prefix "&amp;lt;&amp;lt;SYS&amp;gt;&amp;gt;\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n&amp;lt;&amp;lt;/SYS&amp;gt;&amp;gt;\n\n [INST] Summarize the following book: " --prompt example_inputs/harry_potter_full.txt --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16

Hi @jordancole21 and @kekekawaii2839 ,
Thank you for your interest in our work!

We developed this with the newest versions of pytorch, transformers, and faiss.

In your case, it seems that maybe the faiss version is the problem. Can you install faiss-gpu version 1.7.4 from conda for example https://anaconda.org/conda-forge/faiss-gpu ?

Best,
Uri

Hi @jordancole21 and @kekekawaii2839 , Thank you for your interest in our work!

We developed this with the newest versions of pytorch, transformers, and faiss.

In your case, it seems that maybe the faiss version is the problem. Can you install faiss-gpu version 1.7.4 from conda for example https://anaconda.org/conda-forge/faiss-gpu ?

Best, Uri

Ok thank you! For some reason it looks like the latest version of faiss I can get in Google Colab is 1.7.2 but I'll see if I can find a way to get 1.7.4 to work!

Hi @jordancole21 and @kekekawaii2839 , Thank you for your interest in our work!

We developed this with the newest versions of pytorch, transformers, and faiss.

In your case, it seems that maybe the faiss version is the problem. Can you install faiss-gpu version 1.7.4 from conda for example https://anaconda.org/conda-forge/faiss-gpu ?

Best, Uri

Thank you! In my case, faiss-gpu has version 1.7.3 using conda install faiss-gpu from channel: pytorch. Using conda install -c conda-forge faiss-gpu can get 1.7,4 and it works!

Great, let us know if you have any questions!

Ok finally got it to work in Google Colab on an A100 40G. For anyone curious I used StableBeluga-13B and it took around 9 minutes to get a summary of Harry Potter, which is pretty good especially since you can't even fit the full book in Claude 100k! I'm thoroughly impressed!

Here is the Code I used to get it working in Colab:

First, in order to get the latest version of Faiss you have to upgrade Python to 3.10 since it's automatically set to 3.7

 !wget https://github.com/korakot/kora/releases/download/v0.10/py310.sh
!bash ./py310.sh -b -f -p /usr/local
!python -m ipykernel install --name "py310" --user

Then you'll want to install mini-conda so that you can use install faiss using conda.

################################################################################
# INSTALL CONDA ON GOOGLE COLAB
################################################################################
! wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
! chmod +x Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
! bash ./Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.10/site-packages/')

Install using conda-forge

!conda install -c conda-forge faiss-gpu -y

Then clone the repo in colab

!git clone https://github.com/abertsch72/unlimiformer.git

Install the requirements. (In this instance some of these aren't required but I liked to have them just in case)

%pip install -r requirements.txt
%pip install -q -U bitsandbytes
%pip install -q -U git+https://github.com/huggingface/transformers.git
%pip install -q -U git+https://github.com/huggingface/peft.git
%pip install -q -U git+https://github.com/huggingface/accelerate.git
%pip install -q datasets
%pip install tensorrt

Cd into the src folder in Unlimiformer

%cd /content/unlimiformer/src

Then you should be good to run the script! Just be sure that the --index_devices and --datastore_device are set correctly. In my case I set them to 0

!python run_generation.py --model_type llama --model_name_or_path stabilityai/StableBeluga-13B \
    --prefix "### System:\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n### User:\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " ### Assistant" --test_unlimiformer --fp16 --length 200 --layer_begin 22 \
    --index_devices 0 --datastore_device 0

This worked pretty well after I set the --layer_begin to 22 (a little over half the number of layers in the model). Here's the summary:

Harry Potter is a young boy who discovers he is a wizard, invited to attend the Hogwarts School of Witchcraft and Wizardry. He embarks on an adventure with his friends Ronald Weasley and Hermione Granger to face various challenges and enemies such as Voldemort and Lord Voldemort's supporters. Their journey involves discovering their true identities, unraveling mysteries, and learning valuable lessons about friendship, courage, and the fight against evil.</s>

Thanks again for all the hard work you and your team did @urialon I'm pretty hyped about this!

Ok finally got it to work in Google Colab on an A100 40G. For anyone curious I used StableBeluga-13B and it took around 9 minutes to get a summary of Harry Potter, which is pretty good especially since you can't even fit the full book in Claude 100k! I'm thoroughly impressed!

Here is the Code I used to get it working in Colab:

First, in order to get the latest version of Faiss you have to upgrade Python to 3.10 since it's automatically set to 3.7

 !wget https://github.com/korakot/kora/releases/download/v0.10/py310.sh
!bash ./py310.sh -b -f -p /usr/local
!python -m ipykernel install --name "py310" --user

Then you'll want to install mini-conda so that you can use install faiss using conda.

################################################################################
# INSTALL CONDA ON GOOGLE COLAB
################################################################################
! wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
! chmod +x Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
! bash ./Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.10/site-packages/')

Install using conda-forge

!conda install -c conda-forge faiss-gpu -y

Then clone the repo in colab

!git clone https://github.com/abertsch72/unlimiformer.git

Install the requirements. (In this instance some of these aren't required but I liked to have them just in case)

%pip install -r requirements.txt
%pip install -q -U bitsandbytes
%pip install -q -U git+https://github.com/huggingface/transformers.git
%pip install -q -U git+https://github.com/huggingface/peft.git
%pip install -q -U git+https://github.com/huggingface/accelerate.git
%pip install -q datasets
%pip install tensorrt

Cd into the src folder in Unlimiformer

%cd /content/unlimiformer/src

Then you should be good to run the script! Just be sure that the --index_devices and --datastore_device are set correctly. In my case I set them to 0

!python run_generation.py --model_type llama --model_name_or_path stabilityai/StableBeluga-13B \
    --prefix "### System:\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n### User:\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " ### Assistant" --test_unlimiformer --fp16 --length 200 --layer_begin 22 \
    --index_devices 0 --datastore_device 0

This worked pretty well after I set the --layer_begin to 22 (a little over half the number of layers in the model). Here's the summary:

Harry Potter is a young boy who discovers he is a wizard, invited to attend the Hogwarts School of Witchcraft and Wizardry. He embarks on an adventure with his friends Ronald Weasley and Hermione Granger to face various challenges and enemies such as Voldemort and Lord Voldemort's supporters. Their journey involves discovering their true identities, unraveling mysteries, and learning valuable lessons about friendship, courage, and the fight against evil.</s>

Thanks again for all the hard work you and your team did @urialon I'm pretty hyped about this!

Cool! I'm using llama2-7b-chat-hf on an A100 40G too, and wondering how to solve CUDA Out of Memory error. For me, adding --use_datastore True, --gpu_datastore False and --gpu_index False can deal with inputs of around 80k tokens. After using --layer_begin 22 instead of --layer_begin 16, it can deal inputs that longer than 130k tokens!
Thanks for all of you!

awesome! I'm glad to hear!

Let us know if you have any more questions.

Best,
Uri

sorry to bother again, I'm using command below:

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --prefix "&amp;lt;&amp;lt;SYS&amp;gt;&amp;gt;\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n&amp;lt;&amp;lt;/SYS&amp;gt;&amp;gt;\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --stream_output --index_devices 1 --datastore_device 1

And the output is very strange:

=== GENERATED SEQUENCE 1 (input length: 131345) ===
||| Harry Potter and the Philosopher's Stone, J.K. Rowling.


He had been famous -- Harry -- since he'd become the
magic's hero... He'd been famous in front of his parents... and behind
them... He was a famous wizard -- that was a pretty
strange and exciting way... they didn't
know, did they?... how
many things... they didn't
-- nor did
Dumbledore. He'd been
several days ago... an' I
acknowledge him... It was the only
thing he didn't
know... Knew about the

So Harry
That night he'd had
already... When
famagic cup -- Harry. His father's

There were -- was in
serv -- points.
You --... first
wonder --
Owl.

Full logs for llama-2-7b:

08/15/2023 22:46:30 - WARNING - __main__ - device: cuda, n_gpu: 2, 16-bits training: True
Using pad_token, but it is not set yet.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.70s/it]
08/15/2023 22:47:42 - INFO - __main__ - Namespace(device=device(type='cuda'), fp16=True, jit=False, k=0, length=200, model_name_or_path='meta-llama/Llama-2-7b-chat-hf', model_type='llama', n_gpu=2, no_cuda=False, num_hidden_layers=None, num_return_sequences=1, p=0.9, padding_text='', prefix='&amp;lt;&amp;lt;SYS&amp;gt;&amp;gt;\\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \\n&amp;lt;&amp;lt;/SYS&amp;gt;&amp;gt;\\n\\n [INST] Summarize the following book: ', prompt='example_inputs/harry_potter_full.txt', repetition_penalty=1.0, seed=42, stop_token=None, stream_output=True, suffix=' [/INST]', temperature=1.0, xlm_language='')
08/15/2023 22:47:43 - INFO - Unlimiformer - Encoding 0 to 4096 out of 131345
08/15/2023 22:47:47 - INFO - Unlimiformer - Encoding 2048 to 6144 out of 131345
...
...
08/15/2023 22:48:42 - INFO - Unlimiformer - Encoding 127249 to 131345 out of 131345
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
miniconda/envs/unlimi/lib/python3.8/site-packages/faiss/contrib/torch_utils.py:44: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  x.storage().data_ptr() + x.storage_offset() * 2)

Also, for stabilityai/StableBeluga-7B, its output is strange too, my command is:

python src/run_generation.py --model_type llama --model_name_or_path stabilityai/StableBeluga-7B \
    --prefix "### System:\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n### User:\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " ### Assistant" --test_unlimiformer --fp16 --length 200 --layer_begin 22 --index_devices 1 --datastore_device 1

Output:

=== GENERATED SEQUENCE 1 (input length: 131310) ===
||| Professor Dumbledore ###</s>

Why this happen? Is this because 7b version is less capable than 13b version? Or there's some other reason?

Hi @kekekawaii2839 ,

I'm not sure. We do see that LLama-13B works better, but we also see a large variance when using different values of --layer_begin.

By the way, why are you using this HTML-escaping in the prompt? --prefix "&amp;lt;&amp;lt;SYS&amp;gt;&amp;gt;\n You are a helpful?

Best,
Uri

By the way, why are you using this HTML-escaping in the prompt? --prefix "&amp;lt;&amp;lt;SYS&amp;gt;&amp;gt;\n You are a helpful?

I'm sorry, that's a copy error, and I'm using the right prompt acutally.

I'm not sure. We do see that LLama-13B works better, but we also see a large variance when using different values of --layer_begin.

I've tried different values of --layer_begin, here's some examples on Llama-2-7b-chat-hf:

--layer_begin 8:
  Harry, you're!

Dumbledore.</s>
=====================================================
--layer_begin 10:
Harry?

"Definitely," said 
... _(I omitted the remaining part since it didn't look like a summary of the book)_
=====================================================
--layer_begin 12:
  However, you have noted that (according to the text) he was the son of a wizard and a witch, which makes him a
 member of the noble and ancient house of Potion. However,
there is little evidence that his parents are able to influence him, and his parents had been attacked and disarmy had been, the -- yes, the Ministry had --"



</s>
=====================================================
--layer_begin 14:
 Harry's father's cloak... "Yes, that's right," said Professor
McGonagall, "he's dead, you see." It was as if
Dumbledore had granted him a little bit of his own house, Gryffindor,



"I think I'll just have a nice cup of tea and then, Professor


"Not a chance," said Dumbledore, who, according to




"I don't know what you're talking about," said McGonagall.
... _(I omitted the remaining part since it didn't look like a summary of the book too)_
=====================================================
--layer_begin 16:
  However, there are no additional chapters in "Harry Potter and the Philosopher's Stone." This book ends with Chapter Seventeen, "The lasting -- "




"</s>
=====================================================
--layer_begin 18:
 Harry's parents had been dead for years, but he never forgot them. He carried a photo of them in his pocket everywhere he went, and it made him feel a bit less alone.



Harry was grateful for the help of his friends, who stood by him
even when everyone
else had given up on him. He had an owlery. It was dark, and
Dark."





"I don't know what I'd --"
... _(I omitted the remaining part since it didn't look like a summary of the book too)_
=====================================================
--layer_begin 20:
 Harry Potter and the Philosopher's Stone, J.K. Rowling.


He stood up and did a little dance, his eyes shining.
This is my birthday present," he said. "What do yeh think?

"Excellent," said Hagrid, and he sounded as though he meant it
He wouldn't mind it

"Not bad," said Hermione, "But that's only
-- opened it --"

"And I'm going ter celebrate," said Ron. "Belt up an --


"I say, Harry -- sit down
-- no -- shut up



"It's not a present -- for you -- he's." said



"Oh, let me, my dear, tuck in."



"Gryffindor!" said Hagrid, and he
--
=====================================================
--layer_begin 22:
 Harry Potter and the Philosopher's Stone, J.K. Rowling.


He'd have/ve died! Ah... Dumbledore gave him the day off fervoldemort! Hagrid, however, was spotted me
Then Harry pictured this, their wardrobes were empty, their trunks were packed... they took their notes
There was a pogramma's up, students, an' it was
"Com -- er, five minutes... I mean ter tell you, he was -- d'yeh want ter
Gryffindor. It's goin' ter be --"
"He'd have/ve died!" He'd -- exams -- he'd -- you know the end o'f course... Every Flavor Beans... his was up, so -- Dumbledore approved -- pro -- you --"

Owl.

It seems the model didn't understand the summarization instruction.

I suggest trying larger values as well

I suggest trying larger values as well

Thanks, I tried larger values, but unfortunately, they didn't work on 7b models :(
I think it's because 7b models can't handle very long inputs like full version of harry potter. Now I'm working on shorter inputs to see whether 7b models can handle them.

And also, @jordancole21 , can you tell me more about running 13b model on A100 40G? Every time I run 13b model using the command on 3 A100 40G:

python src/run_generation.py --model_type llama --model_name_or_path stabilityai/StableBeluga-13B \
    --prefix "### System:\nYou are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n\n### User: Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix "\n\n### Assistant:\n" --test_unlimiformer --fp16 --length 200 --layer_begin 22 \
    --index_devices 1 --datastore_device 2 --stream_output

I encounter with CUDA OOM error.

OK, now I use a shorter input of 80k tokens in 7b model, and here's the result:

This is a journal entry written by Peter Allison, an Englishman who traveled to the Amazon jungle in South America to find a jaguar and learn about the Huaorani people. He spent three weeks in the jungle, during which he saw several species of monkeys, but did not spot a jaguar. He spent two nights and three days at a salt lick, where he observed howler monkeys, capuchin, and spider monkeys, as well as aCaimana (armadillo) after they crossed the river. He was glad for the rain that stopped him from having to sit in his hammock and be annoyed with his guide. The most unique part is the way he described his adventure and how he found joy not from seeing what he did not have. 

I'm very excited with this, and thank you @urialon and your team for making this wonderful work and helping me solving many problems!

Amazing @kekekawaii2839 !
Thank you for trying it out, and feel free to continue sharing your findings.

Just for future reference, command line did you use to generate the last output? (I am curious about the exact model, exact layer_begin, and exact prompt)

Best,
Uri

Just for future reference, command line did you use to generate the last output? (I am curious about the exact model, exact layer_begin, and exact prompt)

Best, Uri

here's the command for the 80k tokens input:

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/1.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --index_devices 1 --datastore_device 2 --stream_output

And surprisingly, I modified the instruction a little in the prefix and using a 135k tokens input to test again. Here's the result:

=== GENERATED SEQUENCE 1 (input length: 135830) ===
||| This is a cookbook with recipes for various types of salads. The book is divided into seasons, with a chapter for each season, and each chapter contains a variety of salad recipes, each with a brief description of the dish and an explanation of how to prepare it. Each recipe is accompanied by notes and suggestions for complementary dishes, and wine or beer pairings.

* Winter: Three-Alarm Salad, Galette of Greens and Goat cheese, with Braised Mushrooms and Mustard, Bitter greens, carrot and Pickled Zucchini.
* Spring: Sautéed Chicken, Kitchen Garden Greens, Chèvre, and Chive. Spring Onion Tarts, Toppings, 45
* Summer: Zucchettis, Cauliflower, Ratatelli, Chiveroli, and Chilled, and Lemon Dress.

Amazing! And the command for above:

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the content of the following book: " \
    --prompt example_inputs/1.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --index_devices 1 --datastore_device 2 --stream_output

But it's weird that the model's output for summarizing harry potter is still strange although using the same flags as the above cookbook:

=== GENERATED SEQUENCE 1 (input length: 131318) ===
||| Harry Potter and the Philosopher's Stone, J.K. Rowling.


He had been famous -- Harry -- since he'd become the
magic's hero... He'd been famous in front of the Muggles, too. 
It was a weird feeling, famous -- being a footloos an' talked to'...' 



They'd worked out how to get past Fluffy without trying

"...Midgit... it was Death E ly... Theoretical..."


















































































(Yes, the model indeed output a lot of \n, maybe it's harry's magic I guess...)

Hello,
I got this error message while trying to run the prompt in the README file

python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-7b-chat-hf     --prefix "### System:\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n### User:\n\n [INST] Summarize the following book: "     --prompt example_inputs/harry_potter_full.txt     --suffix " ### Assistant" --test_unlimiformer --fp16 --length 200 --layer_begin 22     --index_devices 0 --datastore_device 0

File "python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 402, in forward
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
AttributeError: 'list' object has no attribute 'get_usable_length'