Intel® Extension for Transformers: Accelerating Transformer-based Models on Intel Platforms

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the key features and examples as below:

Seamless user experience of model compressions on Transformers-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
Accelerated end-to-end Transformer-based applications such as Stable Diffusion, GPT-J-6B, BLOOM-176B, T5, and SetFit

Installation

Install from Pypi

pip install intel-extension-for-transformers

For more installation method, please refer to Installation Page

Getting Started

Sentiment Analysis with Quantization

Prepare Dataset

from datasets import load_dataset, load_metric
from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer

raw_datasets = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)

Quantization

from intel_extension_for_transformers.optimization import QuantizationConfig, metrics, objectives
from intel_extension_for_transformers.optimization.trainer import NLPTrainer

config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config)
model.config.label2id = {0: 0, 1: 1}
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(model=model, 
    train_dataset=raw_datasets["train"], 
    eval_dataset=raw_datasets["validation"],
    tokenizer=tokenizer
)
q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)])
model = trainer.quantize(quant_config=q_config)

input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt")
output = model(**input).logits.argmax().item()

For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix

Documentation

OVERVIEW
Model Compression	Neural Engine	Kernel Libraries	Examples
MODEL COMPRESSION
Quantization	Pruning	Distillation	Orchestration
Neural Architecture Search	Export	Metrics/Objectives	Pipeline
NEURAL ENGINE
Model Compilation	Custom Pattern	Deployment	Profiling
KERNEL LIBRARIES
Sparse GEMM Kernels	Custom INT8 Kernels	Profiling	Benchmark
ALGORITHMS
Length Adaptive		Data Augmentation
TUTORIALS AND RESULTS
Tutorials	Supported Models	Model Performance	Kernel Performance

Selected Publications/Events

Blog published on Medium: MLefficiency — Optimizing transformer models for efficiency (Dec 2022)
NeurIPS'2022: Fast Distilbert on CPUs (Nov 2022)
NeurIPS'2022: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM (Nov 2022)
Blog published by Cohere: Top NLP Papers—November 2022 (Nov 2022)
Blog published by Alibaba: Deep learning inference optimization for Address Purification (Aug 2022)
NeurIPS'2021: Prune Once for All: Sparse Pre-Trained Language Models (Nov 2021)

About

Extending Hugging Face transformers APIs for Transformer-based models and improve the productivity of inference deployment. With extremely compressed models, the toolkit can greatly improve the inference efficiency on Intel platforms.

Apache License 2.0

Languages

Language:C++ 57.0%Language:Python 41.7%Language:CMake 0.9%Language:Shell 0.3%