Zuellni / ComfyUI-ExLlama-Nodes

ExLlama nodes for ComfyUI.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ComfyUI ExLlama Nodes

A simple local text generator for ComfyUI using ExLlamaV2.

Installation

Clone the repository to custom_nodes and install the requirements:

git clone https://github.com/Zuellni/ComfyUI-ExLlama-Nodes custom_nodes/ComfyUI-ExLlamaV2-Nodes
pip install -r custom_nodes/ComfyUI-ExLlamaV2-Nodes/requirements.txt

Use wheels for ExLlamaV2 and FlashAttention on Windows:

pip install exllamav2-X.X.X+cuXXX.torch2.X.X-cp3XX-cp3XX-win_amd64.whl
pip install flash_attn-X.X.X+cuXXX.torch2.X.X-cp3XX-cp3XX-win_amd64.whl

Usage

Only EXL2, 4-bit GPTQ and FP16 models are supported. You can find them on Hugging Face.

To use a model with the nodes, you should clone its repository with git or manually download all the files and place them in a folder in models/llm. For example, if you want to download the 4-bit Llama-3.1-8B-Instruct, use the following command:

git install lfs
git clone https://huggingface.co/turboderp/Llama-3.1-8B-Instruct-exl2 -b 4.0bpw models/llm/Llama-3-8B-Instruct-exl2-4.0bpw

Tip

You can add your own llm path to the extra_model_paths.yaml file and put the models there instead.

Nodes

ExLlama Nodes
Loader Loads models from the llm directory.
cache_bits A lower value reduces VRAM usage, but also affects generation speed and quality.
fast_tensors Enabling reduces RAM usage and speeds up model loading.
flash_attention Enabling reduces VRAM usage, not supported on cards with compute capability lower than 8.0.
max_seq_len Max context, higher value equals higher VRAM usage. 0 will default to model config.
Formatter Formats messages using the model's chat template.
add_assistant_role Appends assistant role to the formatted output.
Tokenizer Tokenizes input text using the model's tokenizer.
add_bos_token Prepends the input with a bos token if enabled.
encode_special_tokens Encodes special tokens such as bos and eos if enabled, otherwise treats them as normal strings.
Settings Optional sampler settings node. Refer to SillyTavern for parameters.
Generator Generates text based on the given input.
unload Unloads the model after each generation to reduce VRAM usage.
stop_conditions A list of strings to stop generation on, e.g. "\n" to stop on newline. Leave empty to only stop on eos.
max_tokens Max new tokens to generate. 0 will use available context.
Text Nodes
Convert Strips punctuation, whitespace, and changes case for input.
Message A message for the Formatter node. Can be chained to create a conversation.
Preview Displays generated text in the UI.
Replace Replaces variable names in curly brackets, e.g. {a}, with their values.
String A string. That's it.

Workflow

An example workflow is embedded in the image below and can be opened in ComfyUI.

workflow

About

ExLlama nodes for ComfyUI.

License:MIT License


Languages

Language:Python 89.1%Language:JavaScript 10.9%