kyegomez / Exa

Unleash the full potential of exascale LLMs on consumer-class GPUs, proven by extensive benchmarks, with no long-term adjustments and minimal learning curve.

Home Page:https://exa.apac.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi-Modality

Exa

Boost your GPU's LLM performance by 300% on everyday GPU hardware, as validated by renowned developers, in just 5 minutes of setup and with no additional hardware costs.


Principles

  • Radical Simplicity (Utilizing super-powerful LLMs with as minimal lines of code as possible)
  • Ultra-Optimizated Peformance (High Performance code that extract all the power from these LLMs)
  • Fludity & Shapelessness (Plug in and play and re-architecture as you please)

🀝 Schedule a 1-on-1 Session

Book a 1-on-1 Session with Kye, the Creator, to discuss any issues, provide feedback, or explore how we can improve Exa for you.


πŸ“¦ Installation πŸ“¦

You can install the package using pip

pip install exxa

Usage

Inference

Generate text using pretrained models with optional quantization with minimal configuration and straightforward usage.

  • Load specified pre-trained models with device flexibility (CPU/CUDA).
  • Set a default maximum length for the generated sequences.
  • Choose to quantize model weights for faster inference.
  • Use a custom configuration for quantization as needed.
  • Generate text through either a direct call or the run method.
  • Simple usage for quick text generation based on provided prompts.
from exa import Inference

model = Inference(
    model_id="georgesung/llama2_7b_chat_uncensored",
    quantize=True
)

model.run("What is your name")

GPTQ Inference

Efficiently generate text using quantized GPT-like models built for HuggingFace's pre-trained models with optional quantization and only a few lines of code for instantiation and generation.

  • Load specified pre-trained models with an option for quantization.
  • Define custom bit depth for the quantization (default is 4 bits).
  • Fine-tune quantization parameters using specific datasets.
  • Set maximum length for generated sequences to maintain consistency.
  • Tokenize prompts and generate text based on them seamlessly.
# !pip install exxa
from exa import GPTQInference

model_id = "gpt2-medium"
inference = GPTQInference(
    model_id, 
    quantization_config_bits=2, 
    max_length=400, 
    quantization_config_dataset='c4'
)
output_text = inference.run("The future of AI is")
print(output_text)

CInference

  • This is optimized Inference with the Ctransformers library!
from exa import CInference

model = CInference('marella/gpt-2-ggml', hf=True)

#run method
output = model.run(
    "ai is going to.....",
    max_new_tokens=256,
    top_k=40,
    top_p=0.95,
    temperature=0.8,
    repition_penalty=1.1
)

print(output)

Quantize

Achieve smaller model sizes and faster inference by utilizing a unified interface tailored to HuggingFace's framework and only a simple class instantiation with multiple parameters is needed.

  • Efficiently quantize HuggingFace's pretrained models with specified bits (default is 4 bits).
  • Set custom thresholds for quantization for precision management.
  • Ability to skip specific modules during quantization for sensitive model parts.
  • Offload parts of the model to CPU in FP32 format for GPU memory management.
  • Specify if model weights are already in FP16 format.
  • Choose from multiple quantization types like "fp4", "int8", and more.
  • Option to enable double quantization for more compression.
  • Verbose logging for a detailed understanding of the quantization process.
  • Seamlessly push to and load models from the HuggingFace model hub.
  • In-built logger initialization tailored for quantization logs.
  • Log metadata for state and settings introspection.
from exa import Quantize

#usage
quantize = Quantize(
     model_id="bigscience/bloom-1b7",
     bits=8,
     enable_fp32_cpu_offload=True,
)

quantize.load_model()
quantize.push_to_hub("my model")
quantize.load_from_hub('my model')

API

To deploy your model as an API, we've provided a simple script to deploy the model with fastapi.

from exa.utils import Deploy
from exa import Inference

model = Inference(
    model_id="georgesung/llama2_7b_chat_uncensored",
    quantize=True
)

api = Deploy()
api.load_model()
api.generate("Hello, my name is whaaa")
api.run()

πŸŽ‰ Features πŸŽ‰

  • World-Class Quantization: Get the most out of your models with top-tier performance and preserved accuracy! πŸ‹οΈβ€β™‚οΈ

  • Automated PEFT: Simplify your workflow! Let our toolkit handle the optimizations. πŸ› οΈ

  • LoRA Configuration: Dive into the potential of flexible LoRA configurations, a game-changer for performance! 🌌

  • Seamless Integration: Designed to work seamlessly with popular models like LLAMA, Falcon, and more! πŸ€–


πŸ’Œ Feedback & Contributions πŸ’Œ

We're excited about the journey ahead and would love to have you with us! For feedback, suggestions, or contributions, feel free to open an issue or a pull request. Let's shape the future of fine-tuning together! 🌱

Check out our project board for our current backlog and features we're implementing


Benchmarks

The following is what we benchmark for according to the πŸ€— LLM-Perf Leaderboard πŸ‹οΈ benchmarks

Metrics

  • Backend 🏭
  • Dtype πŸ“₯
  • Optimizations πŸ› οΈ
  • Quantization πŸ—œοΈ
  • Class πŸ‹οΈ
  • Type πŸ€—
  • Memory (MB) ⬇️
  • Throughput (tokens/s) ⬆️
  • Energy (tokens/kWh) ⬇️
  • Best Score (%) ⬆️
  • Best Scored LLM πŸ†

License

MIT

Todo

  • Setup utils logger classes for metric logging with useful metadata such as token inference per second, latency, memory consumption
  • Add cuda c++ extensions for radically optimized classes for high performance quantization + inference on the edge

About

Unleash the full potential of exascale LLMs on consumer-class GPUs, proven by extensive benchmarks, with no long-term adjustments and minimal learning curve.

https://exa.apac.ai

License:MIT License


Languages

Language:Python 97.1%Language:Dockerfile 1.4%Language:Makefile 1.1%Language:Shell 0.5%