Exa
Boost your GPU's LLM performance by 300% on everyday GPU hardware, as validated by renowned developers, in just 5 minutes of setup and with no additional hardware costs.
Principles
- Radical Simplicity (Utilizing super-powerful LLMs with as minimal lines of code as possible)
- Ultra-Optimizated Peformance (High Performance code that extract all the power from these LLMs)
- Fludity & Shapelessness (Plug in and play and re-architecture as you please)
π€ Schedule a 1-on-1 Session
Book a 1-on-1 Session with Kye, the Creator, to discuss any issues, provide feedback, or explore how we can improve Exa for you.
π¦ Installation π¦
You can install the package using pip
pip install exxa
Usage
Inference
Generate text using pretrained models with optional quantization with minimal configuration and straightforward usage.
- Load specified pre-trained models with device flexibility (CPU/CUDA).
- Set a default maximum length for the generated sequences.
- Choose to quantize model weights for faster inference.
- Use a custom configuration for quantization as needed.
- Generate text through either a direct call or the run method.
- Simple usage for quick text generation based on provided prompts.
from exa import Inference
model = Inference(
model_id="georgesung/llama2_7b_chat_uncensored",
quantize=True
)
model.run("What is your name")
GPTQ Inference
Efficiently generate text using quantized GPT-like models built for HuggingFace's pre-trained models with optional quantization and only a few lines of code for instantiation and generation.
- Load specified pre-trained models with an option for quantization.
- Define custom bit depth for the quantization (default is 4 bits).
- Fine-tune quantization parameters using specific datasets.
- Set maximum length for generated sequences to maintain consistency.
- Tokenize prompts and generate text based on them seamlessly.
# !pip install exxa
from exa import GPTQInference
model_id = "gpt2-medium"
inference = GPTQInference(
model_id,
quantization_config_bits=2,
max_length=400,
quantization_config_dataset='c4'
)
output_text = inference.run("The future of AI is")
print(output_text)
CInference
- This is optimized Inference with the Ctransformers library!
from exa import CInference
model = CInference('marella/gpt-2-ggml', hf=True)
#run method
output = model.run(
"ai is going to.....",
max_new_tokens=256,
top_k=40,
top_p=0.95,
temperature=0.8,
repition_penalty=1.1
)
print(output)
Quantize
Achieve smaller model sizes and faster inference by utilizing a unified interface tailored to HuggingFace's framework and only a simple class instantiation with multiple parameters is needed.
- Efficiently quantize HuggingFace's pretrained models with specified bits (default is 4 bits).
- Set custom thresholds for quantization for precision management.
- Ability to skip specific modules during quantization for sensitive model parts.
- Offload parts of the model to CPU in FP32 format for GPU memory management.
- Specify if model weights are already in FP16 format.
- Choose from multiple quantization types like "fp4", "int8", and more.
- Option to enable double quantization for more compression.
- Verbose logging for a detailed understanding of the quantization process.
- Seamlessly push to and load models from the HuggingFace model hub.
- In-built logger initialization tailored for quantization logs.
- Log metadata for state and settings introspection.
from exa import Quantize
#usage
quantize = Quantize(
model_id="bigscience/bloom-1b7",
bits=8,
enable_fp32_cpu_offload=True,
)
quantize.load_model()
quantize.push_to_hub("my model")
quantize.load_from_hub('my model')
API
To deploy your model as an API, we've provided a simple script to deploy the model with fastapi.
from exa.utils import Deploy
from exa import Inference
model = Inference(
model_id="georgesung/llama2_7b_chat_uncensored",
quantize=True
)
api = Deploy()
api.load_model()
api.generate("Hello, my name is whaaa")
api.run()
π Features π
-
World-Class Quantization: Get the most out of your models with top-tier performance and preserved accuracy! ποΈββοΈ
-
Automated PEFT: Simplify your workflow! Let our toolkit handle the optimizations. π οΈ
-
LoRA Configuration: Dive into the potential of flexible LoRA configurations, a game-changer for performance! π
-
Seamless Integration: Designed to work seamlessly with popular models like LLAMA, Falcon, and more! π€
π Feedback & Contributions π
We're excited about the journey ahead and would love to have you with us! For feedback, suggestions, or contributions, feel free to open an issue or a pull request. Let's shape the future of fine-tuning together! π±
Check out our project board for our current backlog and features we're implementing
Benchmarks
The following is what we benchmark for according to the π€ LLM-Perf Leaderboard ποΈ benchmarks
- https://huggingface.co/spaces/optimum/llm-perf-leaderboard
- https://github.com/huggingface/optimum-benchmark
Metrics
- Backend π
- Dtype π₯
- Optimizations π οΈ
- Quantization ποΈ
- Class ποΈ
- Type π€
- Memory (MB) β¬οΈ
- Throughput (tokens/s) β¬οΈ
- Energy (tokens/kWh) β¬οΈ
- Best Score (%) β¬οΈ
- Best Scored LLM π
License
MIT
Todo
- Setup utils logger classes for metric logging with useful metadata such as token inference per second, latency, memory consumption
- Add cuda c++ extensions for radically optimized classes for high performance quantization + inference on the edge