BatsResearch / bonito

A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What are the bare-bone template of the Bonito?

pacozaa opened this issue · comments

I would like to use this model with Ollama or llama.cpp but I would like to know the bare-bone explanation of bonito's template. Would you mind giving a short explanation?

Ah ha, your paper explain it!

<|tasktype|>
Yes-no question answering
<|context|>
Zinedine Zidane -- After retiring as a player, Zidane
transitioned into coaching, becoming assistant coach at
Real Madrid… after the victory, he resigned as Real
Madrid coach.
<|task|>

Still don't mind more explanation and examples though

Cheers!

You are right. We have included the template in the paper. We have also included the preprocessing step in abstract.py. Please look at the following lines of code.

def _prepare_bonito_input(
self, context_dataset: Dataset, task_type: str, context_col: str, **kwargs
) -> Dataset:
"""
Prepares the input for the Bonito model.
This method takes a context dataset, a task type, and a context
column name, and prepares the dataset for the Bonito model.
If the task type is not recognized, it raises a ValueError.
Args:
context_dataset (Dataset): The dataset that provides the
context for the task.
task_type (str): The type of the task. This can be a
short form or a full form. If the task type is not
recognized, a ValueError is raised.
context_col (str): The name of the column in the dataset
that provides the context for the task.
**kwargs: Additional keyword arguments.
Returns:
Dataset: The prepared dataset for the Bonito model.
"""
# get the task type name
if task_type in SHORTFORM_TO_FULL_TASK_TYPES.values():
full_task_type = task_type
elif task_type in SHORTFORM_TO_FULL_TASK_TYPES:
full_task_type = SHORTFORM_TO_FULL_TASK_TYPES[task_type]
else:
raise ValueError(f"Task type {task_type} not recognized")
def process(example):
input_text = "<|tasktype|>\n" + full_task_type.strip()
input_text += (
"\n<|context|>\n" + example[context_col].strip() + "\n<|task|>\n"
)
return {
"input": input_text,
}
return context_dataset.map(
process,
remove_columns=context_dataset.column_names,
num_proc=kwargs.get("num_proc", 1),
)

Hope this helps! 😄

In case anyone stumbles on this here is Ollama library you can run https://ollama.com/pacozaa/bonito

And quantize and convert to gguf article here
https://medium.com/@sarinsuriyakoon/convert-pytorch-model-to-quantize-gguf-to-run-on-ollama-5c5dbc458208