Error: list index out of range, <class 'IndexError'>, param_server.py, 535

Question

Error: list index out of range, <class 'IndexError'>, param_server.py, 535

li1553770945 opened this issue a year ago · comments

Hello, I am reproducing your project. I have followed the instructions and installed the required environment correctly using conda.

My configuration file is as follows.

ps_ip: 10.128.201.129

worker_ips: 
    - 10.128.201.129:[2,2] # worker_ip: [(# processes on gpu) for gpu in available_gpus]

exp_path: /home/jsac/CodeFolder/PyramidFL/training
python_path: /home/jsac/anaconda3/envs/oort/bin

auth:
    ssh_user: "jsac"
    ssh_private_key: ~/.ssh/id_rsa

setup_commands:
    - source $HOME/anaconda3/bin/activate oort    
    - export NCCL_SOCKET_IFNAME='eno1'         # Run "ifconfig" to ensure the right NIC for nccl if you have multiple NICs


job_conf: 
    - log_path: /home/jsac/CodeFolder/PyramidFL/training/evals # Path of log files
    - job_name: openimage                   # Generate logs under this folder: log_path/job_name/time_stamp
    - total_worker: 100                    # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: openImg                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: /home/jsac/CodeFolder/FedScaleOrigin/benchmark/dataset/data/openImg   # Path of the dataset
    - data_mapfile: /home/jsac/CodeFolder/FedScaleOrigin/benchmark/dataset/data/openImg/client_data_mapping/clientDataMap             # Allocation of data to each client, turn to iid setting if not provided
    - client_path: /home/jsac/CodeFolder/FedScaleOrigin/benchmark/dataset/data/device_info/client_device_capacity 
    - user_trace: /home/jsac/CodeFolder/FedScaleOrigin/benchmark/dataset/data/device_info/client_behave_trace
    - sample_mode: oort                                  # Client selection: random, oort
    - model: shufflenet_v2_x2_0                            # Models: shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - gradient_policy: yogi                 
    # - proxy_avg: True # Commenting out these two lines will turn to "FedAvg"
    - round_penalty: 2.0                    # Penalty factor in our paper (\alpha), \alpha -> 0 turns to (Oort w/o sys)
    - eval_interval: 10                     # How many rounds to run a testing on the testing set
    - epochs: 1000                           # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 30                       # Remove clients w/ less than 16 samples
    - batch_size: 16
    - pacer_delta: 10
    - upload_epoch: 20
    - enable_adapt_local_epoch: True
    - enable_dropout: True
    - adaptive_epoch_beta: 0.5
    # - enforce_random: True
    # - enable_obs_client: True

In the aggregator's log file, the error [param_server.py:575] ====Error: list index out of range, <class 'IndexError'>, param_server.py, 535 is reported. I looked at the code on line 535 and it reads param.data += sumDeltaWeights[idx]. For the sumDeltaWeights variable, I noticed that in param_server.py, line 362, there is

if received_updates == 0.
      sumDeltaWeights.append(model_weight * ratioSample)
else:
    sumDeltaWeights[idx] += model_weight * ratioSample

This code is contained inside a loop for i, clientId in enumerate(clientIds):. I tried to output the clientIds variable, but in the first epoch I found it to be an empty list so the increase operation here is not running, causing sumDeltaWeights to be an empty list as well, which results in an error.

Also, I noticed that the worker's log reads:

2023-08-18:15:07:04,417 INFO     [divide_data.py:510] ========= End of Class/Worker =========

2023-08-18:15:07:04,439 INFO     [learner.py:445] ====Worker: Start running
2023-08-18:15:07:05,230 INFO     [learner.py:483] 
Namespace(adam_epsilon=1e-08, adaptive_epoch_beta=0.5, backend='nccl', batch_size=16, bidirectional=True, blacklist_max_len=0.3, blacklist_rounds=-1, block_size=64, cache_dir=None, capacity_bin=True, clf_block_size=100, client_path='/home/jsac/CodeFolder/FedScaleOrigin/benchmark/dataset/data/device_info/client_device_capacity', clip_bound=0.98, clock_factor=2.906137184115524, conf_path='~/dataset/', config_name=None, cut_off_util=0.7, data_dir='/home/jsac/CodeFolder/FedScaleOrigin/benchmark/dataset/data/openImg', data_mapfile='/home/jsac/CodeFolder/FedScaleOrigin/benchmark/dataset/data/openImg/client_data_mapping/clientDataMap', data_set='openImg', decay_epoch=5, decay_factor=0.95, display_step=20, do_eval=False, do_train=False, dropout_high=0.6, dropout_low=0.1, dump_epoch=1000, duplicate_data=1, enable_adapt_local_epoch=True, enable_dropout=True, enable_importance=False, enable_obs_client=False, enable_obs_importance=False, enable_obs_local_epoch=False, enforce_random=False, epochs=1000, eval_all_checkpoints=False, eval_data_file='', eval_interval=10, eval_interval_prior=9999999, evaluate_during_training=False, exploration_alpha=0.3, exploration_decay=0.95, exploration_factor=0.9, exploration_min=0.2, filter_class=0, filter_less=30, filter_more=100000.0, finetune=False, fixed_clients=False, force_read=False, forward_pass=False, fp16=False, fp16_opt_level='O1', full_gradient_interval=20, gpu_device=0, gradient_accumulation_steps=1, gradient_policy='yogi', hetero_allocation='1.0-1.0-1.0-1.0-1.0-1.0', heterogeneity=1.0, hidden_layers=7, hidden_size=256, home_path='', input_dim=0, is_even_avg=True, job_name='openimage', labels_path='labels.json', learners='1-2-3-4', learning_rate=0.04, line_by_line=False, load_epoch=1, load_model=False, load_time_stamp='0615_194942', local_rank=-1, log_path='/home/jsac/CodeFolder/PyramidFL/training/evals', logging_steps=500, loss_decay=0.2, malicious_clients=0, manager_port=48615, max_grad_norm=1.0, max_iter_store=100, max_steps=-1, min_learning_rate=0.0001, mlm=True, mlm_probability=0.1, model='shufflenet_v2_x2_0', model_avg=True, model_name_or_path=None, model_path=None, model_size=65536, model_type='', no_cuda=False, noise_dir=None, noise_factor=0, noise_max=0.5, noise_min=0.0, noise_prob=0.4, num_class=60, num_loaders=2, num_train_epochs=1.0, output_dim=0, output_dir=None, overcommit=1.3, overwrite_cache=False, overwrite_output_dir=False, pacer_delta=10.0, pacer_step=20, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, proxy_avg=False, proxy_mu=0.1, ps_ip='jh-gpu06', ps_port='42893', read_models_path=False, release_cache=False, resampling_interval=1, rnn_type='lstm', round_penalty=2.0, round_threshold=10, run_all=False, sample_mode='oort', sample_rate=16000, sample_seed=233, sample_window=5.0, sampler_path=None, save_path='./', save_steps=500, save_total_limit=None, score_mode='loss', seed=42, sequential='0', server_ip='', server_port='', should_continue=False, single_sim=0, skip_partition=False, sleep_up=0, spec_augment=False, speed_volume_perturb=False, stale_threshold=0, task='cv', test_bsz=128, test_interval=20, test_manifest='data/test_manifest.csv', test_only=False, test_ratio=1.0, test_train_data=False, this_rank=2, threads=4, time_stamp='0818_145153_56858', timeout=9999999, to_device='cuda', tokenizer_name=None, total_worker=100, train_data_file='', train_manifest='data/train_manifest.csv', upload_epoch=20, user_trace='/home/jsac/CodeFolder/FedScaleOrigin/benchmark/dataset/data/device_info/client_behave_trace', validate_interval=999999, vocab_tag_size=500, vocab_token_size=10000, warmup_steps=0, weight_decay=0.0, window='hamming', window_size=0.02, window_stride=0.01, yogi_beta=0.999, yogi_beta2=-1, yogi_eta=0.005, yogi_tau=0.001, zipf_alpha='5')

2023-08-18:15:07:05,451 INFO     [learner.py:526] ====Start train round 1
2023-08-18:15:07:05,542 INFO     [learner.py:164] Start to run client 3 on rank 2...
2023-08-18:15:07:06,81 INFO     [learner.py:403] ====Error: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED, <class 'RuntimeError'>, learner.py, 312
2023-08-18:15:07:06,124 INFO     [learner.py:437] ====Failed to run client 3
2023-08-18:15:07:06,124 INFO     [learner.py:439] Completed to run client 3
2023-08-18:15:07:06,125 INFO     [learner.py:606] ====Pushing takes 0.0006704330444335938 s

It's not clear to me if Error: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED, <class 'RuntimeError'>, learner.py, 312 is the cause of the empty list for clientIds.

Chenning Li · Answer 1 · Sat Aug 19 2023 04:52:56 GMT+0800 (China Standard Time)

I believe you are right. I guess the cuDNN error makes the client process fail, which cannot return the model updates to the parameter server. As a result, the parameter server has an empty list when aggregating the model updates.

Please fix the cuDNN error first.

Li Yaning · Answer 2 · Sat Aug 19 2023 06:11:54 GMT+0800 (China Standard Time)

Thanks, I've fixed the CUDNN error and it works fine now.

Interestingly, I had previously replicated your project on another server and the same CUDNN error was reported, but it still ran successfully, so I initially thought that this failure to run wouldn't be caused by CUDNN.