alibaba / FederatedScope

An easy-to-use federated learning platform

Home Page:https://www.federatedscope.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about LLaMA based federated training

Polaris-JZ opened this issue · comments

Hi,

I use your config llama.yaml to conduct federated training, but the training log shows that the loss is not decreasing/convergent, and the test loss is very high. I was wondering if there are any problems.

I only change two places in the llama.yaml:

  • train: is_enable_half: True
  • model type: 'baffo32/decapoda-research-llama-7B-hf@huggingface_llm'

Screenshot 2023-12-27 at 14 23 08

The yaml you used is only a testcase, could you please try configs in https://github.com/alibaba/FederatedScope/tree/llm/federatedscope/llm/baseline/exp_yaml/alpaca with tuned hyperparameters?

Thanks for your advice.

I have tried https://github.com/alibaba/FederatedScope/blob/llm/federatedscope/llm/baseline/exp_yaml/alpaca/alpaca_federate.yaml,
the test loss become lower, but train loss also fluctuates a lot.

Additionally, I try to create another dataset followed the format of alpaca, the same thing happens: the loss is not decreasing/convergent, and the test loss is very high (like 3000). Could you please help me to give some advice?

Assuming your dataset is good enough, you can try to adjust the following hyper-parameters:

  • Learning Rate: Fluctuations in training loss might be due to a learning rate that's too high. Try using a smaller learning rate or a learning rate scheduler.
  • Batch Size: SGD is used by default, the size of the mini-batch can significantly affect the training. Try using larger batch-size.

Thanks for your advice.
I have another problem. When I'm trying to use the deepspeed accelation. The error rises: KeyError: 'Non-existent config key: llm.accelation'. My config is:
Screenshot 2024-01-01 at 20 20 08

Sorry for the outdated document. Please use the following configs to setup Deepspeed (for other usage, please refer to https://github.com/alibaba/FederatedScope/blob/llm/federatedscope/core/configs/cfg_llm.py):

    # ---------------------------------------------------------------------- #
    # Deepspeed related options
    # ---------------------------------------------------------------------- #
    cfg.llm.deepspeed = CN()
    cfg.llm.deepspeed.use = False
    cfg.llm.deepspeed.ds_config = ''

We'll fix it ASAP.

Thanks for your reply. I was wondering if there was any method to enable multi-gpu training for a client or under centralized training setting.

You can set cfg.train.data_para_dids = [] # torch.nn.DataParallel devices to enable DataParallel training.