Zeroth-order Training for Lottery-pruned Models & Tensor-compressed Models

This is a PyTorch implementation of Zeroth-order training for MNIST dataset.

Sparse training: Prune at initialization (GraSP https://arxiv.org/abs/2002.07376)
Tensor-compressed models: TT-format and TTM-format

Requirements:

Python >= 3.6
PyTorch >= 1.8.0
Tensorflow >= 2.5.0
pyutils >= 0.0.1. See pyutils for installation.
- A tricky part: comment line 32 of ./setup.py when installing pyutils. tensorflow-gpu is not supported now
NVIDIA GPUs and CUDA >= 10.2
Others are listed in requirements.txt

Usage:

MNIST

For GraSP-pruned FC layers:

# FO-benchmark
python -u main_prune_MNIST.py -config configs/MNIST/FC/FO.yml

# ZO-gradient estimator
python main_prune_MNIST.py -config configs/MNIST/TTM/SGD.yml

# ZO-finite difference
python -u main_prune_MNIST.py -config configs/MNIST/FC/SCD_esti.yml
# ZO-coordinate descent
python -u main_prune_MNIST.py -config configs/MNIST/FC/SCD_batch.yml

For TTM layers:

# FO-benchmark
python -u main_prune_MNIST.py -config configs/MNIST/TTM/FO.yml

# ZO-gradient estimator
python main_prune_MNIST.py -config configs/MNIST/TTM/SGD.yml

# ZO-finite difference
python main_prune_MNIST.py -config configs/MNIST/TTM/SCD_esti.yml
# ZO-coordinate descent
python main_prune_MNIST.py -config configs/MNIST/TTM/SCD_batch.yml

2-layer Encoder:

Select the provided experiments in ./run_tensors.sh
run:

./run_tensors.sh

Zeroth-order Optimizer

ZO_SGD_mask:

./optimizer/ZO_SGD_mask.py

Based on stochastic gradient estimator

perturb all parameters with i.i.d. Gaussian perturbation
evaluate the change of Loss function -> evaluate the directional direction of selected random direction
get single-shot gradient estimation
The expectation of gradient estimation is a bounded bias estimation of the true gradient

def __init__(
        self,
        model: nn.Module,
        criterion: Callable,
        masks,
        lr: float = 0.01,
        sigma: float = 0.1,
        n_sample: int = 20,
        signSGD: bool = False,
        layer_by_layer: bool = False,
        opt_layers_strs: list = []
    ):

ZO_SCD_mask

./optimizer/ZO_SGD_mask.py

def __init__(
        self,
        model: nn.Module,	# 
        criterion: Callable,
        masks,
        lr: float = 0.1,
        grad_sparsity: float = 0.1,
        h_smooth: float = 0.001,
        grad_estimator: str = 'sign',
        opt_layers_strs: list = [],
        STP: bool = True,
        momentum: float = 0,
        weight_decay: float = 0,
        dampening: float = 0,
        adam: bool = False,
        beta_1: float = 0.9,
        beta_2: float = 0.98,
        eps: float = 1e-06
    ):

grad_estimator: update rule

'sign': ZO-det Coordinate Descent, update the parameter one-by-one
'batch': ZO-det Coordinate Descent, update all parameters at the end of evaluation
'esti': ZO-finite difference, update all parameters at the end of evaluation

opt_layers_strs: layers that need to be trained. now supports:

'nn.Linear': nn.Linear,
'nn.Conv2d': nn.Conv2d,
'TensorizedLinear': TensorizedLinear,
'TensorizedLinear_module': TensorizedLinear_module,
'TensorizedLinear_module_tonn': TensorizedLinear_module_tonn

olokevin / GraSP_ZO