Instella is a family of state-of-the-art open language models trained on AMD Instinct™ MI300X GPUs by the AMD GenAI team. Instella models significantly outperform existing fully open language models of similar size, as well as bridges the gap between fully open and open weight models by achieving competitive performance compared to Llama-3.2-3B and Qwen2.5-3B models. We provide the model weights, training code, and training data to accelerate the development of open-source language models.

First install PyTorch according to the instructions specific to your operating system. For AMD GPUs, you can aslo start with a rocm/pytorch docker.
To install from source (recommended for training/fine-tuning) run:
git clone https://github.com/AMD-AIG-AIMA/Instella.git
cd Instella
# install Flash-Attention on MI300X
GPU_ARCH=gfx942 MAX_JOBS=$(nproc) pip install git+https://github.com/Dao-AILab/flash-attention.git -v
# install other dependencies
pip install -e .[all]
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "amd/Instella-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", trust_remote_code=True)
prompt = [{"role": "user", "content": "What are the benefits of open-source AI research?"}]
inputs = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
return_tensors='pt'
)
tokens = model.generate(
inputs.to(model.device),
max_new_tokens=1024,
temperature=0.8,
do_sample=True
)
print(tokenizer.decode(tokens[0], skip_special_tokens=False))
You can also use the TRL CLI to chat with the model from the terminal:
pip install trl
trl chat --model_name_or_path amd/Instella-3B-Instruct --trust_remote_code --max_new_tokens 1024
# <root>:
# which is bigger 9.8 or 9.11?
# <amd/Instella-3B-Instruct>:
# 9.8 is bigger than 9.11. The difference between the two numbers is 0.69 (9.8 - 9.11 = 0.69), which indicates that 9.8 is 0.69 units larger than 9.11.
We use the OLMoE-mix-0924 dataset for stage 1 pretraining. After downloading the dataset, run the following to tokenize the text data:
pip install dolma
bash scripts/prepare_pretrain_data_stage1.sh
To prepare the second stage training data, download the dolmino-mix-1124, python-edu, dm_math datasets, and run the data preparation script:
bash scripts/prepare_pretrain_data_stage2.sh
The configs used to train the Instella-3B models are provided in the configs
directory.
Once you've updated the data paths in the config you can launch a training run via torchrun
. For example, to launch the 3B model training on a single 8x GPU node, you would run:
torchrun --nproc_per_node=8 scripts/train.py configs/instella-3b-pretrain-stage1.yaml
To resume training from a checkpoint, you can pass its path to scripts/train.py
with the --load_path
arguments. For example, to resume training from step 10000 of the Instella pretraining run:
torchrun --nproc_per_node=8 scripts/train.py configs/instella-3b-pretrain-stage1.yaml --load_path output/pretrain/Instella-3B-pretrain-stage1/latest
To launch multi-node jobs, run the following on each of the nodes:
torchrun --nnodes=$NUM_NODES --nproc_per_node=8 --rdzv_id=$JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT scripts/train.py configs/instella-3b-pretrain-stage1.yaml
where NUM_NODES
is the total number of nodes, JOB_ID
is the user-defined job id, MASTER_ADDR
is the IP address of the master node and MASTER_PORT
is the port on the MASTER_ADDR
that can be used to host the C10d TCP store. Please refer to this documentation for torchrun
to understand the arguments to configure the rendezvous backend for multi-node training.
For the second stage pretraining, we trained the model from the first stage checkpoints with three random seeds (see the configs: 5796, 6198, and 8915), and then merge the checkpoints with this script.
Run the following commands to prepare the SFT data:
bash scripts/prepare_sft_data.sh
Launch the SFT job with the SFT config file:
torchrun --nproc_per_node=8 scripts/train.py configs/instella-3b-sft.yaml
Note: please make sure to update load_path
to your final pretrain checkpoint.
We conduct DPO after SFT using open-instruct with this commit. Please follow their instructions to install the package and then run the DPO training:
accelerate launch \
--mixed_precision bf16 \
--num_machines 1 \
--num_processes 8 \
--use_deepspeed \
--deepspeed_config_file configs/ds_stage2.conf \
scripts/dpo_tune.py \
configs/instella-3b-dpo.yaml
Please refer to this folder for detailed instructions for model evaluation.
Synthetic data generation for GSM8k is a multi-step process:
- Original question -> Masked question (The numerical values in the question are replaced by variables).
- Masked question -> Program (Code to solve the masked question).
- Program -> Perturbed questions (New questions where the values have been perturbed).
- Perturbed questions -> Chain of thought solutions.
Some steps are repeated multiple times until we know that the output is correct. Specifically, in steps 2 and 4, we already know the answer, so if the answer from the generated programs (or CoTs) don't match the expected answer, we re-run the previous steps.
For steps 1 and 2, please run the following command:
python -W ignore scripts/generate_gsm8k_programs.py
For steps 3 and 4, please run the following command:
python -W ignore scripts/generate_gsm8k_new_samples.py 0
- Pre-trained models:
- Instella-3B-Stage1: amd/Instella-3B-Stage1, First stage pre-training checkpoint.
- Instella-3B: amd/Instella-3B, Final pre-training checkpoint.
- Instruction-tuned models:
- Instella-3B-SFT: amd/Instella-3B-SFT, Supervised fine-tuned checkpoint.
- Instella-3B-Instruct: amd/Instella-3B-Instruct, Final Instruction-tuned checkpoint.
Second stage pre-training GSM8k synthetic dataset: amd/Instella-GSM8K-synthetic
- The dataset consists of two splits: “train” and “train_119K”.
- For Instella-3B model second stage pre-training we used the “train_119K” split, which is a subset of the larger “train” split.
Please refer to the following blogs to get started with using these techniques on AMD GPUs:
- PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm™
- Accelerating Large Language Models with Flash Attention on AMD GPUs
- Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm™
- Introducing the First AMD 1B Language Models: AMD OLMo
This codebase is built from OLMo.
- The Instella-3B models are licensed for academic and research purposes under a ResearchRAIL license.
- The amd/Instella-GSM8K-synthetic dataset used in second stage pre-training is built with Qwen2.5-72B-Instruct, and is licensed for academic and research purposes under a ResearchRAIL license. Refer to the LICENSE and NOTICES in the amd/Instella-GSM8K-synthetic dataset card files for more information.
- Refer to the LICENSE and NOTICES files for more information.
Feel free to cite our Instella-3B models:
@misc{Instella,
title = {Instella: Fully Open Language Models with Stellar Performance},
url = {https://huggingface.co/amd/Instella-3B},
author = {Jiang Liu, Jialian Wu, Xiaodong Yu, Prakamya Mishra, Sudhanshu Ranjan, Zicheng Liu, Chaitanya Manem, Yusheng Su, Pratik Prabhanjan Brahma, Gowtham Ramesh, Ximeng Sun, Ze Wang, Emad Barsoum},
month = {March},
year = {2025}
}
Footnotes
-
Here even for instruct models, we compared against pre-training tokens as 1) exact open weigth instruct model training token numbers are unknown, and 2) adding instruct model training tokens (in billions) leads to marginally insignificant shift in trends. ↩