Aspis's and Aspis+'s robust detection/aggregation ML framework implementation.
This project builds on ByzShield and implements the detection and subsequent aggregation of our proposed Aspis and Aspis+ algorithms for robust distributed machine learning. The gradients are assigned to the workers using a subset assignment in the case of Aspis, while we use balanced incomplete block designs (BIBDs) for the task assignment in Aspis+. We test against the most sophisticated attacks in an omniscient scenario and compare with competing methods; weaker non-adversarial attacks have also been considered.
We will be working with Python 2 for the local machine (to execute the bash scripts which configure the remote cluster and initiate training/testing) and with Python 3 for the remote cluster of PS/worker nodes (to execute the actual training/testing). We recommend using an Anaconda (tested with 2020.02) environment in both cases. The local machine would typically be a Linux system (tested with Ubuntu). Below, we have reported the exact version of each module that worked for us; however, your mileage may vary.
This project is intended to be launched on AWS EC2. It also supports local execution (for MNIST) which we won't discuss here but the procedure is very similar (email me if you need instructions for that).
The first steps we need to do before installing the required packages are
- Install and configure AWS CLI on the local machine (tested with version 2.0.16).
- Launch an AWS EC2 instance of AMI "Ubuntu Server 16.04 LTS (HVM), SSD Volume Type (64-bit (x86))". We will install the packages on this instance and we will use it as a basis to create PS/worker instances (see AMI). Most of the instance specs may be left to their default values but we will probably need a minimum 20GiB of storage and a security group with the following settings
Type | Protocol | Port Range | Source |
---|---|---|---|
All traffic | All | 0-65535 | Anywhere |
Create a security group (instructions) with the above settings and give it a name, e.g., aspis_security_group
. We will also use it later.
Note: The above port range guarantees that MPI can run. Otherwise, mpirun
may hang indefinitely.
sudo apt-get update && sudo apt-get upgrade
# Find the latest Anaconda version from https://www.anaconda.com/products/individual (tested with 2020.02) and download
cd ~ && sudo apt-get install curl && curl -O https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
# Install Anaconda (press Enter multiple times until the license aggreement asks you to type 'yes' and press Enter)
bash Anaconda3-2020.02-Linux-x86_64.sh
# Press Enter to install in default location...
# Type 'yes' and press Enter to the prompt of the installer
# "Do you wish the installer to initialize Anaconda3 by running conda init?"...
# Apply the changes immediately so that you don't have to reboot/relogin
. .bashrc
# To disable each shell session having the base environment auto-activated (optional)
conda config --set auto_activate_base False
The tested dependencies versions for the local/remote machines are given next:
Module | Local | Remote |
---|---|---|
Python | 2.7.18 | 3.7.7 |
pip | 20.1.1 | 20.1.1 |
setuptools | 44.1.0 | 47.1.1 |
python-blosc | 1.7.0 | 1.7.0 |
joblib | 0.13.2 | 0.15.1 |
paramiko | 1.18.4 | 2.7.1 |
boto3 | 1.12.39 | 1.9.66 |
pytorch | N/A | 1.0.1 |
torchvision | N/A | 0.2.2 |
libgcc | N/A | 7.2.0 |
pandas | N/A | 1.0.3 |
scipy | N/A | 1.4.1 |
mpi4py | N/A | 3.0.3 |
hdmedians | N/A | 0.13 |
networkx | N/A | 2.5 |
The exact series of commands for the local machine is
conda create -n aspis_local_python2 python=2.7
conda activate aspis_local_python2
conda install pip
python -m pip install --upgrade pip
pip install --upgrade setuptools
conda install -y -c conda-forge python-blosc
conda install -y -c anaconda joblib
conda install -y -c anaconda paramiko
conda install -y -c anaconda boto3
The exact series of commands for the remote machine is
conda create -n aspis python=3.7
conda activate aspis
conda install pip
python -m pip install --upgrade pip
pip install --upgrade setuptools
conda install -y pytorch==1.0.1 torchvision cpuonly -c pytorch
conda install -y -c anaconda python-blosc
conda install -y -c anaconda joblib
conda install -y -c anaconda paramiko
conda install -y -c anaconda boto3
conda install -y -c anaconda libgcc
conda install -y -c anaconda pandas
conda install -y -c anaconda scipy
conda install -y -c anaconda mpi4py
conda install -y -c anaconda networkx
# Install hdmedians
sudo apt-get install gcc && sudo apt-get install git
git clone https://github.com/daleroberts/hdmedians.git
cd hdmedians
python setup.py install
We will now discuss how one can launch a cluster and train/test a model. In the sequel, we will use the notation {x}
to denote a piece of a script that should be substituted with the value x
. Some notation used in the paper that we will refer to is:
K
: number of workers.q
: number of Byzantine workers.r
: replication.b
: batchsize.
Now that we have installed all needed dependencies on the remote EC2 instance, we need to make an AMI image of it so that we can quickly launch PS/worker instances out of it. For instructions see here. Make note of the created AMI ID, like ami-xxxxxxxxxxxxxxxx
.
We will use Amazon Elastic File System (EFS) to share a folder with the trained model among the machines. Follow the instructions to create an EFS. We will probably need a security group with the settings discussed above for the instance (you can reuse the same one). Make note of the IP address of the EFS xxx.xxx.xxx.xxx
.
The script pytorch_ec2.py
will launch the instances automatically. Before running it, you need to copy your AWS private key file xxxxxx.pem
to the folder ./tools
and then edit the following part of the configuration (the rest of the parameters can be left unchanged and have been omitted here):
cfg = Cfg({
"key_name": "{Your AWS private key file without the .pem extenion}", # string
"n_workers" : {Number of workers (K)}, # integer
"region" : "{Your AWS region, e.g., us-east-1}", # string
"availability_zone" : "{Your AWS subnet, e.g., us-east-1c}", # string
"master_type" : "{Instance type of PS, e.g., r3.xlarge}", # string
"worker_type" : "{Instance type of workers, e.g., r3.xlarge}", # string
"image_id": "{AMI ID created before, like ami-xxxxxxxxxxxxxxxx}", # string
"spot_price" : "{Dollar amount per machine per hour at least max(price of PS, price of worker), e.g., 3.5}", # float
"path_to_keyfile" : "{Your AWS private key file with the .pem extenion, like xxxxxx.pem}", # string
"nfs_ip_address" : "{IP address of the EFS xxx.xxx.xxx.xxx}", # string
"nfs_mount_point" : "{Path to EFS folder named "shared", set to /home/ubuntu/shared}", # string
"security_group": ["{Name of the AWS security group, mentioned before as aspis_security_group}"], # string
})
Note: The data sets will be downloaded from the PS to the EFS folder and then fetched from all machines from there. For this to work, the EFS folder needs to be named shared
and located at the home directory ~
. Since ~
is same as /home/ubuntu
for AWS Ubuntu instances, you need to set the EFS folder above to be ~/shared
or /home/ubuntu/shared
.
Note: In above configuration, make sure that the chosen instance types master_type
and worker_type
are available in the selected AWS region (region
) and availability zone (availability_zone
).
Next, use the chmod command to make sure your private key file isn't publicly viewable:
chmod 400 {xxxxxx}.pem
Now, from the local machine and the Python 2 envirnoment aspis_local_python2
, launch replicas of the AMI (those will be the PS and worker instances) running the following:
conda activate aspis_local_python2
python ./tools/pytorch_ec2.py launch
Wait for the command to finish so that all instances are ready. Then, run the following command to generate a file hosts_address
with the private IPs of all instances. The first line of the file is the IP of the PS (this will also be printed on the terminal):
python ./tools/pytorch_ec2.py get_hosts
Use the private IP of the PS fetched above to do some SSH configuration and copy the project files to the PS:
bash ./tools/local_script.sh {private IP of the PS}
SSH into the PS, the rest of the commands will be executed there:
ssh -i {Your AWS private key file with the .pem extenion} ubuntu@{private IP of the PS}
On the PS, download, split and normalize the MNIST/Cifar10/SVHN/Cifar100 data sets:
conda activate aspis && bash ./src/data_prepare.sh
Note: This requires sudo
permissions to save data sets to the EFS folder. Since sudo
uses a different path than your typical environment, you need to specify that you want to use the Anaconda Python 3 environment we created before rather than the system python
. To do that make sure that data_prepare.sh
points to that environment
sudo {Path to Anaconda "python" file, e.g., /home/ubuntu/anaconda3/envs/aspis/bin/python} ./datasets/data_prepare.py
On the PS, run the remote_script.sh
to configure the SSH and copy the project files and data sets to the workers:
bash ./tools/remote_script.sh
The training algorithm should be run by the PS instance executing file run_pytorch.sh
. The basic arguments of this script along with all possible values are below. This is not an exhaustive list of all arguments but only the basic ones, the remaining can be left to their default values in run_pytorch.sh
.
Argument | Values/description |
---|---|
n |
Total number of nodes (PS and workers), equal to K+1 in paper |
hostfile |
Path to MPI hostfile that contains the private IPs of all nodes of the cluster. If ran on AWS this file will be hosts_address , discussed above. If ran locally this file can be a plain txt with content localhost:{n+1} . |
lr |
Inital learning rate. |
momentum |
Value of momentum. |
network |
Deep neural net: LeNet ,ResNet18 ,ResNet34 ,ResNet50 ,DenseNet ,VGG11 or VGG13 . |
dataset |
Data set: MNIST , Cifar10 , SVHN or Cifar100 . |
batch-size |
Batchsize: equal to b in Aspis, equal to br/K in DETOX, equal to b/K in vanilla batch-SGD (baseline). |
mode |
Robust aggregation method: coord-median , bulyan , multi-krum , sign-sgd or geometric_median (only supported in baseline). |
adversarial-detection |
Method used to detect and potentially exclude adversaries: clique (look for a unique maximum clique of size K-q), degree (look for workers with degree less than K-q-1). The surviving gradients must be aggregated with a mode=coord-median aggregator. If set to anything else, no detection will take place. |
approach |
Distributed learning scheme baseline (vanilla), mols (ByzShield MOLS), rama_one (ByzShield Ramanujan Case 1), rama_two (ByzShield Ramanujan Case 2), draco_lite (DETOX), draco_lite_attack (our attack on DETOX), maj_vote , subset (proposed Aspis assignment), cyclic_c3les (Cyclic code in the C3LES paper (Figure 3)), hard_coded (Hard-coded assignment defined in util.py ) |
eval-freq |
Frequency of iterations to backup trained model (for evaluation). |
err-mode |
Byzantine attack to simulate: rev_grad (reversed gradient) or constant (constant gradient) or foe ("Fall of Empires"), refer to src/model_ops/util.py for details. |
err-choice |
How adversaries choose which files to distort: fixed_disagreement (all q adversaries disagree with a fixed set of honest workers), all (all adversaries distort all files) or only_majorities (adversaries distort only files with adversarial majority). The only_majorities currently works only with ALIE and "Fall of Empires". The fixed_disagreement and all work with other attacks as well. |
epochs |
Number of epochs to train. |
max-steps |
Total number of iterations (across all epochs). |
worker-fail |
Number of Byzantine workers, equal to q in paper. |
group-size |
Replication factor, equal to r in paper. |
lis-simulation |
Attack "A Little Is Enough": simulate (enabled) or no (disabled), the err-mode will be disabled/ignored if ALIE attack is enabled. |
train-dir |
Directory to save model backups for evaluation (for AWS this should be the EFS folder). |
local-remote |
local (for local training) or remote (for training on AWS). |
rama-m |
Value of m (in paper), only needed for Ramanujan Case 2. |
detox-attack |
Our attack on DETOX (see --approach ): worst (optimally attacks majority within groups), benign or whole_group . If approach=draco_lite_attack , this method overrides byzantine-gen . |
byzantine-gen |
Type of byzantine set generation (random (random for each iteration) or hard_coded (fixed for all iterations and set in util.py )) or random_window (random adversarial set in each contiguous window of adv_win_length iterations). These are supported by all approach es. |
gamma |
Learning rate decay (linear). |
lr-step |
Frequency of learning rate decay (measured in number of iterations). Applies only if lr-warmup=no (see below). |
max-grad-l2norm |
0 (disabled) or > 0 (enabled). If > 0, it is the maximum L-2 norm of the final gradient that the PS will clip to for the global model update at the end of each iteration. |
cyclic-ell |
If approach=cyclic_c3les , it is the computation load per worker for the cyclic code in the C3LES paper (Figure 3). |
pair-groups |
Number of joint files of each pair of workers. This is required only if (adversarial-detection=clique and approach!=subset ) or adversarial-detection=degree . It is mostly useful for approach=hard_coded . |
adv-win-length |
How often to pick a new set of adversaries (in # of iterations); only needed if byzantine-gen=random_window . |
det-win-length |
How often to reset the agreements counter for detection at the PS level (in # of iterations); only needed if adversarial-detection=degree . |
permute-files |
yes or no . If yes , it will randomly permute the file assignment after each iteration (adversarial indices won't change). It works only if approach is one of the following: mols , rama_one , rama_two , subset , cyclic_c3les or hard_coded . |
tolerance |
Tolerance for gradient equality checks to use with np.allclose(), e.g., for majority voting or for determining the edges of the detection graph. A value equal to 10 seems to work well. This is used if approach is one of the following: draco_lite , draco_lite_attack , mols , rama_one , rama_two , subset , cyclic_c3les or hard_coded |
This functionality has been added to incorporate ideas from the paper Stochastic Training is Not Necessary for Generalization, Geiping et al., 2021. The following arguments work only if approach
is one of the following: mols
, rama_one
, rama_two
, subset
, cyclic_c3les
or hard_coded
, i.e., when the utilized PS is byzshield_master.py
. Do not use it with other approach
es. The method supports learning rate warmup followed by cosine annealing (only warmup or both of them are supposed to be activated, do not enable only the annealing).
Argument | Values/description |
---|---|
lr-warmup |
yes or no . If yes , it warms up the learning rate starting from (1/maxlr-steps)*maxlr going until maxlr until the step indexed with maxlr-steps , in small increments. |
maxlr |
If lr-warmup=yes , this is the maximum learning rate to warm up to. Also, if lr-annealing=yes , this is also the maximum value of the cosine wave for the learning rate. |
maxlr-steps |
See lr-warmup . |
lr-annealing |
yes or no . If yes , it does cosine annealing to the learning rate between the values lr-annealing-minlr and maxlr . The period of the cosine wave is lr-annealing-cycle-steps . Cosine annealing starts the learning rate warmup, so do not enable it if lr-warmup=no . |
lr-annealing-minlr |
See lr-annealing . |
lr-annealing-cycle-steps |
See lr-annealing . |
To initiate training, from the PS run:
bash ./src/run_pytorch.sh 1
By convention, worker 1 will fetch the model from the shared EFS folder and evaluate it. To achieve this, from the PS, run (for the definitions of q
, lr
and gamma
see above, they should match the values used for training):
bash ./src/evaluate_pytorch.sh 1 {q} {lr} {gamma}
The basic arguments of this script along with all possible values are below.
Argument | Values/description |
---|---|
eval-batch-size |
Number of samples to iteratively evaluate model on. |
eval-freq |
Set to the same value as eval-freq used for training in run_pytorch.sh (any multiple of it will also work). |
network |
Deep neural net: LeNet ,ResNet18 ,ResNet34 ,ResNet50 ,DenseNet ,VGG11 or VGG13 . |
dataset |
Data set: MNIST , Cifar10 , SVHN or Cifar100 . |
model-dir |
Set to the same value as train-dir used for training in run_pytorch.sh . |
If you use this code please cite our paper available at arXiv. The BibTeX is:
@inproceedings{konstantinidis_ramamoorthy_aspis_aspis_plus,
title = {Efficient Detection and Filtering Systems for Distributed Training},
author = {Konstantinos Konstantinidis and Aditya Ramamoorthy},
year = {2022},
month = {August},
url = {https://arxiv.org/abs/2208.08085}}