Use finetuned model for inference programmatically

Question

Use finetuned model for inference programmatically

lighteternal opened this issue a year ago · comments

Dimitris Papadopoulos commented a year ago

Hi and kudos for this awesome tool! 💯
I have finetuned a model on my own dataset and I can quantitatively assess its performance via the inference tab.
However, I'd prefer to have a script that allows me to use it locally.

Am I right to assume that the generate method in llama_lora/lib/inference.py can be used to load the model and use it for prediction? A snippet/notebook would be extremely helpful!

Many thanks! ❤️

Pokai Chang · Answer 1 · Fri May 19 2023 00:17:41 GMT+0800 (China Standard Time)

Yes, it's assumed to work without dependencies on the UI, as I want to build other kinds of UI or even a CLI interface. You can try to import and call it and see if it works! Let me know if you encounter any problems and I'll be happy to help.

Dimitris Papadopoulos · Answer 2 · Fri May 19 2023 15:32:47 GMT+0800 (China Standard Time)

So, I tried to import and use generate but I am missing the values of most of its arguments:

def generate(
    model,
    tokenizer,
    prompt,
    generation_config,
    max_new_tokens,
    stopping_criteria=[],
    stream_output=False
):

I am confused whether we need to load the base Llama values (model, tokenizer) in the above method or the finetuned LoRA model ones. I assume it's the former, but if so, how/where are we loading the adapter_config?

If you could provide a snippet that given a prompt and an input performs inference using the alpaca-lora-7b-yoda-v01 model (non-UI, just print output on Colab) it would be super easy for me to adjust it to my own use case.

Pokai Chang · Answer 3 · Fri May 19 2023 17:09:54 GMT+0800 (China Standard Time)

!git clone https://github.com/zetavg/LLaMA-LoRA-Tuner.git llm_tuner
!cd llm_tuner && git checkout dfc944d
!pip install -r llm_tuner/requirements.lock.txt

from transformers import AutoModelForCausalLM, LlamaTokenizer
from peft import PeftModel

tokenizer = LlamaTokenizer.from_pretrained('decapoda-research/llama-7b-hf')
model = AutoModelForCausalLM.from_pretrained('decapoda-research/llama-7b-hf', load_in_8bit=True, device_map={'': 0})
model = PeftModel.from_pretrained(model, 'zetavg/alpaca-lora-7b-yoda-v01', device_map={'': 0})

from transformers import GenerationConfig
from llm_tuner.llama_lora.lib.inference import generate

generation_config = GenerationConfig(
    temperature=0.7,
    top_p=0.75,
    top_k=10,
    repetition_penalty=1.8,
    num_beams=2,
    do_sample=True,
)

output = next(generate(model, tokenizer, "### Human:\nWho is Yoda?\n\n### AI:\n", generation_config, 128))
print(output)

Dimitris Papadopoulos · Answer 4 · Fri May 19 2023 18:51:40 GMT+0800 (China Standard Time)

This is really helpful. Thanks!

I am able to load the model without issues. I however notice a big difference in the results compared to those presented in the UI. More precisely, the UI outputs are almost always correct, while the console/Colab outputs are mostly wrong.

My assumption is that the input to the generate function is not correctly parsed. To make your code work in my case, I changed the following line:

output = next(generate(model, tokenizer, "### Human:\nWho is Yoda?\n\n### AI:\n", generation_config, 128))

with

output = next(generate(model, tokenizer, input, generation_config, 128))

and my input variable contained the entire instruction+input as shown in the Preview window in Gradio. For example:

input = '''
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Given the following text, discover any automobile brands that are mentioned in it:

### Input:
With a low starting price, the 2023 2-series Gran Coupe plants its flag on the affordable end of the BMW lineup but it lacks the harmonious feel of its stablemates.

### Response:
'''

In the Gradio GUI I get the response [BMW] which is consistent to my training data. However, in Colab I get either an inf error or a more verbose response (like 2023 2-series Gran Coupe) or 'No brand mentioned`.

I test both the Gradio and Colab implementations with the same generation config. I also test with temp = 0 but the difference in outputs remains.

Another possible explanation is that the adapter weights are not loaded correctly (?), but that seems unlikely.

Am I missing something?

Edit:
I forgot to mention that in the Colab version, the print(output) command prints the whole input + output text, like this:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Given the following text, discover any automobile brands that are mentioned in it:

### Input:
With a low starting price, the 2023 2-series Gran Coupe plants its flag on the affordable end of the BMW lineup but it lacks the harmonious feel of its stablemates.

### Response:
2023 2-series Gran Coupe
'''

while the Gradio one just provides the response in brackets, as intended.

Pokai Chang · Answer 5 · Fri May 19 2023 18:58:59 GMT+0800 (China Standard Time)

Try setting do_sample=False in GenerationConfig and remove temperature, top_k, top_p. Can check https://github.com/zetavg/LLaMA-LoRA-Tuner/blob/fcc807e/llama_lora/ui/inference_ui.py#L77

Or maybe it has something to do with load_in_8bit=True. That won't be necessary if you're using A100 GPU.

Dimitris Papadopoulos · Answer 6 · Fri May 19 2023 21:31:09 GMT+0800 (China Standard Time)

Unfortunately, this doesn't seem to work.
This is my current snippet:

from transformers import AutoModelForCausalLM, LlamaTokenizer
from peft import PeftModel
import torch

tokenizer = LlamaTokenizer.from_pretrained('decapoda-research/llama-7b-hf')
model = AutoModelForCausalLM.from_pretrained('decapoda-research/llama-7b-hf', torch_dtype = torch.float16, load_in_8bit=True, device_map={'': 0})
model = PeftModel.from_pretrained(model, '/content/drive/MyDrive/Colab Data/LLaMA-LoRA Tuner/lora_models/custom-model', device_map={'': 0})

from transformers import GenerationConfig
from llm_tuner.llama_lora.lib.inference import generate

generation_config = GenerationConfig(
            num_beams=2,
            do_sample=False
)

doc = "<...>"
instruction = "<...>"

prompt_input = f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{doc}\n\n### Response:\n"

output = next(generate(model, tokenizer, prompt_input, generation_config, 128))

The model hallucinates and/or rambles the input text. It almost never matches the answer from the Gradio UI

Pokai Chang · Answer 7 · Sat May 20 2023 08:40:40 GMT+0800 (China Standard Time)

😢 I have no clue for now. Maybe you can try overriding the generate function and intercept the arguments passed to it via Gradio UI and see if there're any differences. It can be done in the Colab notebook by executing this code right before launching the UI (after initialize_global()):

from llm_tuner.llama_lora.globals import Global
from llm_tuner.llama_lora.lib.inference import generate

def custom_inference_generate_fn(**kwargs):
    print('Args for generate:', kwargs)
    for output in generate(**kwargs):
        yield output

Global.inference_generate_fn = custom_inference_generate_fn

(llm_tuner might be llama_lora_tuner or llama_lora base on how git clone is done)

LimLims · Answer 8 · Sat May 20 2023 09:25:56 GMT+0800 (China Standard Time)

I've just cannibalised inference_ui to do exactly this. Not sure how much of this is actually necessary, but at least it works:

def prep_base_model():
   # @title Load the App (set config, prepare data dir, load base model)

   # @markdown For a LLaMA-7B model, it will take about ~5m to load for the first execution,
   # @markdown including download. Subsequent executions will take about 2m to load.

   base_model = "eachadea/vicuna-7b-1.1" # @param {type:"string"}

   # Set Configs
   from llama_lora.config import Config, process_config
   from llama_lora.globals import initialize_global
   Config.default_base_model_name = base_model
   Config.base_model_choices = [base_model]
   data_dir_realpath = !realpath ./data
   Config.data_dir = data_dir_realpath[0]
   Config.load_8bit = True
   process_config()
   initialize_global()

   # Prepare Data Dir
   from llama_lora.utils.data import init_data_dir
   init_data_dir()

   # Load the Base Model
   from llama_lora.models import prepare_base_model
   prepare_base_model()


def run_tests():
   from llama_lora.utils.prompter import Prompter
   from transformers import GenerationConfig

   from llama_lora.config import Config
   from llama_lora.globals import Global
   from llama_lora.models import get_model, get_tokenizer, get_device

   lora_model_name = "addition_fullrange_lefttoright_count_1.1_best"
   prompt_template = "alpaca"
   temperature = 0
   top_p = 0.5
   top_k = 0.5
   repetition_penalty = 0.01
   num_beams = 2
   max_new_tokens = 128
   stream_output = False

   print('>>>>>')
   base_model_name = Global.base_model_name
   print(base_model_name)



   def generate_text(model, tokenizer, instruction, temperature, top_p, top_k, repetition_penalty, num_beams):
      variable_0 = instruction
      variable_1 = None
      variable_2 = None
      variable_3 = None
      variable_4 = None
      variable_5 = None
      variable_6 = None
      variable_7 = None

      variables = [variable_0, variable_1, variable_2, variable_3,
                           variable_4, variable_5, variable_6, variable_7]
      prompter = Prompter(prompt_template)
      prompt = prompter.generate_prompt(variables)

      generation_config = GenerationConfig(
            # to avoid ValueError('`temperature` has to be a strictly positive float, but is 2')
            temperature=float(temperature),
            top_p=top_p,
            top_k=top_k,
            repetition_penalty=repetition_penalty,
            num_beams=num_beams,
            # https://github.com/huggingface/transformers/issues/22405#issuecomment-1485527953
            do_sample=temperature > 0,
         )
         
      def ui_generation_stopping_criteria(input_ids, score, **kwargs):
         if Global.should_stop_generating:
               return True
         return False
      Global.should_stop_generating = False

      generation_args = {
                  'model': model,
                  'tokenizer': tokenizer,
                  'prompt': prompt,
                  'generation_config': generation_config,
                  'max_new_tokens': max_new_tokens,
                  'stopping_criteria': [ui_generation_stopping_criteria],
                  'stream_output': stream_output
            }


      for (decoded_output, output, completed) in Global.inference_generate_fn(**generation_args):
         response = prompter.get_response(decoded_output)

      return response

   tokenizer = get_tokenizer(base_model_name)
   model = get_model(base_model_name, lora_model_name)

   instruction = "What is the capital of Mars?"
   try:
      result = generate_text(model, tokenizer, instruction, temperature, top_p, top_k, repetition_penalty, num_beams)
      print(result)
   except Exception as e:
      print(e)

prep_base_model()
run_tests()

LimLims · Answer 9 · Sun May 21 2023 09:00:54 GMT+0800 (China Standard Time)

Hi and kudos for this awesome tool! 💯 I have finetuned a model on my own dataset and I can quantitatively assess its performance via the inference tab. However, I'd prefer to have a script that allows me to use it locally.

Am I right to assume that the generate method in llama_lora/lib/inference.py can be used to load the model and use it for prediction? A snippet/notebook would be extremely helpful!

Many thanks! ❤️

Oh -- there's also an API that gets spun up with the Gradio UI. Scroll to the bottom of the page for the link.

Dimitris Papadopoulos · Answer 10 · Sun May 21 2023 12:51:08 GMT+0800 (China Standard Time)

Thank you all! For some weird reason, the newly trained model (with all q,k,v,o targets trained) performs as intended using the generate function. Maybe I had a wrong config in the previous one that led to loading the base model only.

I probably also have to tweak the output a bit so that it only contains the model response (not along with the instruction and input).