helena-intel / test-prompt-generator

Create prompts with a given token length for testing LLMs and other transformers text models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Test Prompt Generator

Create prompts with a given token length for testing LLMs and other transformers text models.

Pre-created prompts for popular model architectures are provided in .jsonl files in the prompts directory.

To generate one or a few prompts, or to test the functionality, you can use the Test Prompt Generator Space on Hugging Face.

Install

pip install git+https://github.com/helena-intel/test-prompt-generator.git transformers

Some tokenizers may require additional dependencies. For example, sentencepiece or protobuf.

Usage

Specify a tokenizer, and the number of tokens the prompt should have. A prompt will be returned that, when tokenized with the given tokenizer, contains the requested number of tokens.

For tokenizer, use a model_id from the Hugging Face hub, a path to a local file, or one of the preset tokenizers: ['bert', 'blenderbot', 'bloom', 'bloomz', 'chatglm3', 'falcon', 'gemma', 'gpt-neox', 'llama', 'magicoder', 'mistral', 'mpt', 'opt', 'phi-2', 'pythia', 'qwen', 'redpajama', 'roberta', 'starcoder', 't5', 'vicuna', 'zephyr']. The preset tokenizers should work for most models with that architecture, but if you want to be sure, use an exact model_id. This list shows the exact tokenizers used for the presets.

Prompts are generated by truncating a given source text at the provided number of tokens. By default Alice in Wonderland is used; you can also provide your own source. A prefix can optionally be prepended to the text, to create prompts like "Please summarize the following text: [text]". The prompts are returned by the function/command line app, and can also optionally be saved to a .jsonl file.

Python API

Basic usage

from test_prompt_generator import generate_prompt

# use preset value for opt tokenizer
prompt = generate_prompt(tokenizer_id="opt", num_tokens=32)
# use model_id
prompt = generate_prompt(tokenizer_id="facebook/opt-2.7b", num_tokens=32)

Slightly less basic usage

Add a source_text_file and prefix. Instead of source_text_file, you can also pass source_text containing a string with the source text.

from test_prompt_generator import generate_prompt

prompt = generate_prompt(
    tokenizer_id="mistral",
    num_tokens=32,
    source_text_file="source.txt",
    prefix="Please translate to Dutch:",
    output_file="prompt_32.jsonl",
)

Use multiple token sizes. When using multiple token sizes, output_file is required, and the generate_prompt function does not return anything. The output_file will contain one line for each token size.

prompt = generate_prompt(
    tokenizer_id="mistral",
    num_tokens=[32,64,128],
    output_file="prompts.jsonl",
)

NOTE: When specifing one token size, the prompt will be returned as string, making it easy to copy and use in a test scenario where you need one prompt. When specifying multiple token sizes a dictionary with the prompts will be returned. The output file is always in .jsonl format, regardless of the number of generated prompts.

Command Line App

test-prompt-generator -t mistral -n 32

Use test-prompt-generator --help to see all options:

usage: test-prompt-generator [-h] -t TOKENIZER -n NUM_TOKENS [-p PREFIX] [-o OUTPUT_FILE] [--overwrite] [-v] [-f FILE]

options:
  -h, --help            show this help message and exit
  -t TOKENIZER, --tokenizer TOKENIZER
                        preset tokenizer id, model_id from Hugging Face hub, or path to local directory with tokenizer files. Options for presets are: ['bert', 'bloom', 'gemma', 'chatglm3', 'falcon', 'gpt-neox',
                        'llama', 'magicoder', 'mistral', 'opt', 'phi-2', 'pythia', 'roberta', 'qwen', 'starcoder', 't5']
  -n NUM_TOKENS, --num_tokens NUM_TOKENS
                        Number of tokens the generated prompt should have. To specify multiple token sizes, use e.g. `-n 16 32`
  -p PREFIX, --prefix PREFIX
                        Optional: prefix that the prompt should start with. Example: 'Translate to Dutch:'
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Optional: Path to store the prompt as .jsonl file
  --overwrite           Overwrite output_file if it already exists.
  -v, --verbose
  -f FILE, --file FILE  Optional: path to text file to generate prompts from. Default text_files/alice.txt

Disclaimer

This software is provided "as is" and for testing purposes only. The author makes no warranties, express or implied, regarding the software's operation, accuracy, or reliability.

About

Create prompts with a given token length for testing LLMs and other transformers text models.


Languages

Language:Python 100.0%