Use finetuned model for inference programmatically
lighteternal opened this issue · comments
Hi and kudos for this awesome tool! 💯
I have finetuned a model on my own dataset and I can quantitatively assess its performance via the inference tab.
However, I'd prefer to have a script that allows me to use it locally.
Am I right to assume that the generate
method in llama_lora/lib/inference.py
can be used to load the model and use it for prediction? A snippet/notebook would be extremely helpful!
Many thanks! ❤️
Yes, it's assumed to work without dependencies on the UI, as I want to build other kinds of UI or even a CLI interface. You can try to import and call it and see if it works! Let me know if you encounter any problems and I'll be happy to help.
So, I tried to import and use generate
but I am missing the values of most of its arguments:
def generate(
model,
tokenizer,
prompt,
generation_config,
max_new_tokens,
stopping_criteria=[],
stream_output=False
):
I am confused whether we need to load the base Llama values (model, tokenizer) in the above method or the finetuned LoRA model ones. I assume it's the former, but if so, how/where are we loading the adapter_config?
If you could provide a snippet that given a prompt and an input performs inference using the alpaca-lora-7b-yoda-v01
model (non-UI, just print output on Colab) it would be super easy for me to adjust it to my own use case.
!git clone https://github.com/zetavg/LLaMA-LoRA-Tuner.git llm_tuner
!cd llm_tuner && git checkout dfc944d
!pip install -r llm_tuner/requirements.lock.txt
from transformers import AutoModelForCausalLM, LlamaTokenizer
from peft import PeftModel
tokenizer = LlamaTokenizer.from_pretrained('decapoda-research/llama-7b-hf')
model = AutoModelForCausalLM.from_pretrained('decapoda-research/llama-7b-hf', load_in_8bit=True, device_map={'': 0})
model = PeftModel.from_pretrained(model, 'zetavg/alpaca-lora-7b-yoda-v01', device_map={'': 0})
from transformers import GenerationConfig
from llm_tuner.llama_lora.lib.inference import generate
generation_config = GenerationConfig(
temperature=0.7,
top_p=0.75,
top_k=10,
repetition_penalty=1.8,
num_beams=2,
do_sample=True,
)
output = next(generate(model, tokenizer, "### Human:\nWho is Yoda?\n\n### AI:\n", generation_config, 128))
print(output)
This is really helpful. Thanks!
I am able to load the model without issues. I however notice a big difference in the results compared to those presented in the UI. More precisely, the UI outputs are almost always correct, while the console/Colab outputs are mostly wrong.
My assumption is that the input to the generate
function is not correctly parsed. To make your code work in my case, I changed the following line:
output = next(generate(model, tokenizer, "### Human:\nWho is Yoda?\n\n### AI:\n", generation_config, 128))
with
output = next(generate(model, tokenizer, input, generation_config, 128))
and my input variable contained the entire instruction+input as shown in the Preview window in Gradio. For example:
input = '''
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Given the following text, discover any automobile brands that are mentioned in it:
### Input:
With a low starting price, the 2023 2-series Gran Coupe plants its flag on the affordable end of the BMW lineup but it lacks the harmonious feel of its stablemates.
### Response:
'''
In the Gradio GUI I get the response [BMW]
which is consistent to my training data. However, in Colab I get either an inf
error or a more verbose response (like 2023 2-series Gran Coupe
) or 'No brand mentioned`.
I test both the Gradio and Colab implementations with the same generation config. I also test with temp = 0 but the difference in outputs remains.
Another possible explanation is that the adapter weights are not loaded correctly (?), but that seems unlikely.
Am I missing something?
Edit:
I forgot to mention that in the Colab version, the print(output)
command prints the whole input + output text, like this:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Given the following text, discover any automobile brands that are mentioned in it:
### Input:
With a low starting price, the 2023 2-series Gran Coupe plants its flag on the affordable end of the BMW lineup but it lacks the harmonious feel of its stablemates.
### Response:
2023 2-series Gran Coupe
'''
while the Gradio one just provides the response in brackets, as intended.
Try setting do_sample=False
in GenerationConfig
and remove temperature
, top_k
, top_p
. Can check https://github.com/zetavg/LLaMA-LoRA-Tuner/blob/fcc807e/llama_lora/ui/inference_ui.py#L77
Or maybe it has something to do with load_in_8bit=True
. That won't be necessary if you're using A100 GPU.
Unfortunately, this doesn't seem to work.
This is my current snippet:
from transformers import AutoModelForCausalLM, LlamaTokenizer
from peft import PeftModel
import torch
tokenizer = LlamaTokenizer.from_pretrained('decapoda-research/llama-7b-hf')
model = AutoModelForCausalLM.from_pretrained('decapoda-research/llama-7b-hf', torch_dtype = torch.float16, load_in_8bit=True, device_map={'': 0})
model = PeftModel.from_pretrained(model, '/content/drive/MyDrive/Colab Data/LLaMA-LoRA Tuner/lora_models/custom-model', device_map={'': 0})
from transformers import GenerationConfig
from llm_tuner.llama_lora.lib.inference import generate
generation_config = GenerationConfig(
num_beams=2,
do_sample=False
)
doc = "<...>"
instruction = "<...>"
prompt_input = f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{doc}\n\n### Response:\n"
output = next(generate(model, tokenizer, prompt_input, generation_config, 128))
The model hallucinates and/or rambles the input text. It almost never matches the answer from the Gradio UI
😢 I have no clue for now. Maybe you can try overriding the generate function and intercept the arguments passed to it via Gradio UI and see if there're any differences. It can be done in the Colab notebook by executing this code right before launching the UI (after initialize_global()
):
from llm_tuner.llama_lora.globals import Global
from llm_tuner.llama_lora.lib.inference import generate
def custom_inference_generate_fn(**kwargs):
print('Args for generate:', kwargs)
for output in generate(**kwargs):
yield output
Global.inference_generate_fn = custom_inference_generate_fn
(llm_tuner
might be llama_lora_tuner
or llama_lora
base on how git clone
is done)
I've just cannibalised inference_ui to do exactly this. Not sure how much of this is actually necessary, but at least it works:
def prep_base_model():
# @title Load the App (set config, prepare data dir, load base model)
# @markdown For a LLaMA-7B model, it will take about ~5m to load for the first execution,
# @markdown including download. Subsequent executions will take about 2m to load.
base_model = "eachadea/vicuna-7b-1.1" # @param {type:"string"}
# Set Configs
from llama_lora.config import Config, process_config
from llama_lora.globals import initialize_global
Config.default_base_model_name = base_model
Config.base_model_choices = [base_model]
data_dir_realpath = !realpath ./data
Config.data_dir = data_dir_realpath[0]
Config.load_8bit = True
process_config()
initialize_global()
# Prepare Data Dir
from llama_lora.utils.data import init_data_dir
init_data_dir()
# Load the Base Model
from llama_lora.models import prepare_base_model
prepare_base_model()
def run_tests():
from llama_lora.utils.prompter import Prompter
from transformers import GenerationConfig
from llama_lora.config import Config
from llama_lora.globals import Global
from llama_lora.models import get_model, get_tokenizer, get_device
lora_model_name = "addition_fullrange_lefttoright_count_1.1_best"
prompt_template = "alpaca"
temperature = 0
top_p = 0.5
top_k = 0.5
repetition_penalty = 0.01
num_beams = 2
max_new_tokens = 128
stream_output = False
print('>>>>>')
base_model_name = Global.base_model_name
print(base_model_name)
def generate_text(model, tokenizer, instruction, temperature, top_p, top_k, repetition_penalty, num_beams):
variable_0 = instruction
variable_1 = None
variable_2 = None
variable_3 = None
variable_4 = None
variable_5 = None
variable_6 = None
variable_7 = None
variables = [variable_0, variable_1, variable_2, variable_3,
variable_4, variable_5, variable_6, variable_7]
prompter = Prompter(prompt_template)
prompt = prompter.generate_prompt(variables)
generation_config = GenerationConfig(
# to avoid ValueError('`temperature` has to be a strictly positive float, but is 2')
temperature=float(temperature),
top_p=top_p,
top_k=top_k,
repetition_penalty=repetition_penalty,
num_beams=num_beams,
# https://github.com/huggingface/transformers/issues/22405#issuecomment-1485527953
do_sample=temperature > 0,
)
def ui_generation_stopping_criteria(input_ids, score, **kwargs):
if Global.should_stop_generating:
return True
return False
Global.should_stop_generating = False
generation_args = {
'model': model,
'tokenizer': tokenizer,
'prompt': prompt,
'generation_config': generation_config,
'max_new_tokens': max_new_tokens,
'stopping_criteria': [ui_generation_stopping_criteria],
'stream_output': stream_output
}
for (decoded_output, output, completed) in Global.inference_generate_fn(**generation_args):
response = prompter.get_response(decoded_output)
return response
tokenizer = get_tokenizer(base_model_name)
model = get_model(base_model_name, lora_model_name)
instruction = "What is the capital of Mars?"
try:
result = generate_text(model, tokenizer, instruction, temperature, top_p, top_k, repetition_penalty, num_beams)
print(result)
except Exception as e:
print(e)
prep_base_model()
run_tests()
Hi and kudos for this awesome tool! 💯 I have finetuned a model on my own dataset and I can quantitatively assess its performance via the inference tab. However, I'd prefer to have a script that allows me to use it locally.
Am I right to assume that the
generate
method inllama_lora/lib/inference.py
can be used to load the model and use it for prediction? A snippet/notebook would be extremely helpful!Many thanks! ❤️
Oh -- there's also an API that gets spun up with the Gradio UI. Scroll to the bottom of the page for the link.
Thank you all! For some weird reason, the newly trained model (with all q,k,v,o targets trained) performs as intended using the generate
function. Maybe I had a wrong config in the previous one that led to loading the base model only.
I probably also have to tweak the output a bit so that it only contains the model response (not along with the instruction and input).