Optimizer Utils
pip install git+https://github.com/anminhhung/oput
import oput
# model = ...
optimizer = oput.SophiaG(
model.parameters(),
lr=2e-4,
betas=(0.965, 0.99),
rho = 0.01,
weight_decay=1e-1
)
optimizer.step()
The learning rate is a crucial hyperparameter that controls the step size of the parameter updates during the optimization process. In Decoupled Sophia, the update is written as
p.data.addcdiv_(-group['lr'], m, h.add(group['rho']))
which is equivalent to the update in the paper up to a re-parameterization.
Choose the learning rate to be about half the learning rate that you would use for AdamW. Some partial ongoing results indicate that the learning rate can be made even larger, possibly leading to faster convergence. Rho (rho) The rho parameter is used in the update rule to control the Hessian's influence on the parameter updates. It is essential to choose an appropriate value for rho to balance the trade-off between the gradient and the Hessian information.
Consider choosing rho in the range of 0.03 to 0.04. The rho value seems transferable across different model sizes. For example, rho = 0.03 can be used in 125M and 335M Sophia-G models.
The (lr, rho) for 335M Sophia-G is chosen to be (2e-4, 0.03). Though we suspect that the learning rate can be larger, it's essential to experiment with different values to find the best combination for your specific use case.
While the learning rate and rho are the most critical hyperparameters to tune, you may also experiment with other hyperparameters such as betas, weight_decay, and k (the frequency of Hessian updates). However, the default values provided in the optimizer should work well for most cases.
Remember that hyperparameter tuning is an iterative process, and the best values may vary depending on the model architecture and dataset. Don't hesitate to experiment with different combinations and validate the performance on a held-out dataset or using cross-validation.
Feel free to share your findings and experiences during hyperparameter tuning. Your valuable feedback and comments can help improve the optimizer and its usage in various scenarios.
Ready to train plug in and play file with your own model or Andromeda
Performance improvements: Investigate and implement potential performance improvements to further reduce training time and computational resources -> Decoupled Sophia + heavy metric logging + Implement in Triton and or Jax?
Additional Hessian estimators: Research and implement other Hessian estimators to provide more options for users.
Hyperparameter tuning: Develop a set of recommended hyperparameters for various use cases and model architectures.
Integration with Andromeda model: Train the Andromeda model using the Sophia optimizer and compare its performance with other optimizers.
Sophia optimizer variants: Explore and develop variants of the Sophia optimizer tailored for specific tasks, such as computer vision, multi-modality AI, and natural language processing, and reinforcement learning.
Distributed training: Implement support for distributed training to enable users to train large-scale models using Sophia across multiple devices and nodes.
Automatic hyperparameter tuning: Develop an automatic hyperparameter tuning module to help users find the best hyperparameters for their specific use case.
Training multiple models in parallel: Develop a framework for training multiple models concurrently with different optimizers, allowing users to test and compare the performance of various optimizers, including Sophia, on their specific tasks.
Sophia optimizer for other domains: Adapt the Sophia optimizer for other domains, such as optimization in reinforcement learning, Bayesian optimization, and evolutionary algorithms.
By following this roadmap, we aim to make the Sophia optimizer a powerful and versatile tool for the deep learning community, enabling users to train their models more efficiently and effectively.
Use Momo optimizer
import oput
# model = ...
optimizer = oput.Momo(
model.parameters(),
lr=1e-2
)
Use MomoAdam optimizer
import oput
# model = ...
optimizer = oput.MomoAdam(
model.parameters(),
lr=1e-2
)
Note that Momo needs access to the value of the batch loss. In the .step() method, you need to pass either
- the loss tensor (when backward has already been done) to the argument loss
- or a callable closure to the argument closure that computes gradients and returns the loss.
For example:
def loss_fn(criterion, running_loss, outputs, labels):
loss = criterion(outputs, labels)
running_loss += loss.item()
loss.backward()
return loss
# in each training step, use:
outputs = model(images)
optimizer.zero_grad()
closure = lambda: loss_fn(criterion, running_loss, outputs, labels) # define a closure that return loss
optimizer.step(closure)
import oput
# model = ...
optimizer = oput.Lion(
model.parameters(),
betas=(0.9, 0.99),
weight_decay=0.0
)
optimizer.step()
import oput
# model = ...
optimizer = oput.Adan(
model.parameters(),
lr = 1e-3,
betas = (0.1, 0.1, 0.001),
weight_decay = 0
)
optimizer.step()
import oput
# model = ...
base_optimizer = torch.optim.SGD # define an optimizer for the "sharpness-aware" update
optimizer = oput.SAM(
model.parameters(),
base_optimizer,
lr=0.1,
momentum=0.9
)
optimizer.step()
import oput
# model = ...
optimizer = oput.A2GradExp(
model.parameters(),
beta=10.0,
lips=10.0,
rho=0.5,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.A2GradInc(
model.parameters(),
beta=10.0,
lips=10.0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.A2GradUni(
model.parameters(),
beta=10.0,
lips=10.0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.AccSGD(
model.parameters(),
lr=1e-3,
kappa=1000.0,
xi=10.0,
small_const=0.7,
weight_decay=0
)
optimizer.step()
import oput
# model = ...
optimizer = oput.AdaBelief(
model.parameters(),
lr= 1e-3,
betas=(0.9, 0.999),
eps=1e-3,
weight_decay=0,
amsgrad=False,
weight_decouple=False,
fixed_decay=False,
rectify=False,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.AdaBound(
model.parameters(),
lr= 1e-3,
betas= (0.9, 0.999),
final_lr = 0.1,
gamma=1e-3,
eps= 1e-8,
weight_decay=0,
amsbound=False,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.AdaMod(
model.parameters(),
lr= 1e-3,
betas=(0.9, 0.999),
beta3=0.999,
eps=1e-8,
weight_decay=0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.AdamP(
model.parameters(),
lr= 1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0,
delta = 0.1,
wd_ratio = 0.1
)
optimizer.step()
import oput
# model = ...
optimizer = oput.AggMo(
model.parameters(),
lr= 1e-3,
betas=(0.0, 0.9, 0.99),
weight_decay=0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.Apollo(
model.parameters(),
lr= 1e-2,
beta=0.9,
eps=1e-4,
warmup=0,
init_lr=0.01,
weight_decay=0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.DiffGrad(
model.parameters(),
lr= 1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.Lamb(
model.parameters(),
lr= 1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0,
)
optimizer.step()
import oput
# model = ...
# base optimizer, any other optimizer can be used like Adam or DiffGrad
yogi = oput.Yogi(
model.parameters(),
lr= 1e-2,
betas=(0.9, 0.999),
eps=1e-3,
initial_accumulator=1e-6,
weight_decay=0,
)
optimizer = oput.Lookahead(yogi, k=5, alpha=0.5)
optimizer.step()
import oput
# model = ...
optimizer = oput.MADGRAD(
model.parameters(),
lr=1e-2,
momentum=0.9,
weight_decay=0,
eps=1e-6,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.NovoGrad(
model.parameters(),
lr= 1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0,
grad_averaging=False,
amsgrad=False,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.PID(
model.parameters(),
lr=1e-3,
momentum=0,
dampening=0,
weight_decay=1e-2,
integral=5.0,
derivative=10.0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.QHAdam(
m.parameters(),
lr= 1e-3,
betas=(0.9, 0.999),
nus=(1.0, 1.0),
weight_decay=0,
decouple_weight_decay=False,
eps=1e-8,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.QHM(
model.parameters(),
lr=1e-3,
momentum=0,
nu=0.7,
weight_decay=1e-2,
weight_decay_type='grad',
)
optimizer.step()
import oput
# model = ...
optimizer = oput.RAdam(
model.parameters(),
lr= 1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.Ranger(
model.parameters(),
lr=1e-3,
alpha=0.5,
k=6,
N_sma_threshhold=5,
betas=(.95, 0.999),
eps=1e-5,
weight_decay=0
)
optimizer.step()
import oput
# model = ...
optimizer = oput.RangerQH(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999),
nus=(.7, 1.0),
weight_decay=0.0,
k=6,
alpha=.5,
decouple_weight_decay=False,
eps=1e-8,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.RangerVA(
model.parameters(),
lr=1e-3,
alpha=0.5,
k=6,
n_sma_threshhold=5,
betas=(.95, 0.999),
eps=1e-5,
weight_decay=0,
amsgrad=True,
transformer='softplus',
smooth=50,
grad_transformer='square'
)
optimizer.step()
import oput
# model = ...
optimizer = oput.SGDP(
model.parameters(),
lr= 1e-3,
momentum=0,
dampening=0,
weight_decay=1e-2,
nesterov=False,
delta = 0.1,
wd_ratio = 0.1
)
optimizer.step()
import oput
# model = ...
optimizer = oput.SGDW(
model.parameters(),
lr= 1e-3,
momentum=0,
dampening=0,
weight_decay=1e-2,
nesterov=False,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.SWATS(
model.parameters(),
lr=1e-1,
betas=(0.9, 0.999),
eps=1e-3,
weight_decay= 0.0,
amsgrad=False,
nesterov=False,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.Shampoo(
model.parameters(),
lr=1e-1,
momentum=0.0,
weight_decay=0.0,
epsilon=1e-4,
update_freq=1,
)
optimizer.step()
import optim
# model = ...
optimizer = optim.Yogi(
model.parameters(),
lr= 1e-2,
betas=(0.9, 0.999),
eps=1e-3,
initial_accumulator=1e-6,
weight_decay=0,
)
optimizer.step()
import oput
# model = ...
optimizer = oput.Adam(
model.parameters(),
lr=0.0001,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0
)
optimizer.step()
import oput
# model = ...
optimizer = oput.SGD(
model.parameters(),
lr=0.0001,
momentum=0,
dampening=0,
weight_decay=0
)
optimizer.step()