Question about LLaMA based federated training

Question

Question about LLaMA based federated training

Polaris-JZ opened this issue 6 months ago · comments

Hi,

I use your config llama.yaml to conduct federated training, but the training log shows that the loss is not decreasing/convergent, and the test loss is very high. I was wondering if there are any problems.

I only change two places in the llama.yaml:

train: is_enable_half: True
model type: 'baffo32/decapoda-research-llama-7B-hf@huggingface_llm'

Weirui Kuang · Answer 1 · Wed Dec 27 2023 14:54:44 GMT+0800 (China Standard Time)

The yaml you used is only a testcase, could you please try configs in https://github.com/alibaba/FederatedScope/tree/llm/federatedscope/llm/baseline/exp_yaml/alpaca with tuned hyperparameters?

Jujia · Answer 2 · Wed Dec 27 2023 16:55:13 GMT+0800 (China Standard Time)

Thanks for your advice.

I have tried https://github.com/alibaba/FederatedScope/blob/llm/federatedscope/llm/baseline/exp_yaml/alpaca/alpaca_federate.yaml,
the test loss become lower, but train loss also fluctuates a lot.

Additionally, I try to create another dataset followed the format of alpaca, the same thing happens: the loss is not decreasing/convergent, and the test loss is very high (like 3000). Could you please help me to give some advice?

Weirui Kuang · Answer 3 · Fri Dec 29 2023 16:31:31 GMT+0800 (China Standard Time)

Assuming your dataset is good enough, you can try to adjust the following hyper-parameters:

Learning Rate: Fluctuations in training loss might be due to a learning rate that's too high. Try using a smaller learning rate or a learning rate scheduler.
Batch Size: SGD is used by default, the size of the mini-batch can significantly affect the training. Try using larger batch-size.

Jujia · Answer 4 · Mon Jan 01 2024 20:20:27 GMT+0800 (China Standard Time)

Thanks for your advice.
I have another problem. When I'm trying to use the deepspeed accelation. The error rises: KeyError: 'Non-existent config key: llm.accelation'. My config is:

Weirui Kuang · Answer 5 · Tue Jan 02 2024 10:47:38 GMT+0800 (China Standard Time)

Sorry for the outdated document. Please use the following configs to setup Deepspeed (for other usage, please refer to https://github.com/alibaba/FederatedScope/blob/llm/federatedscope/core/configs/cfg_llm.py):

    # ---------------------------------------------------------------------- #
    # Deepspeed related options
    # ---------------------------------------------------------------------- #
    cfg.llm.deepspeed = CN()
    cfg.llm.deepspeed.use = False
    cfg.llm.deepspeed.ds_config = ''

We'll fix it ASAP.

Jujia · Answer 6 · Tue Jan 02 2024 11:51:55 GMT+0800 (China Standard Time)

Thanks for your reply. I was wondering if there was any method to enable multi-gpu training for a client or under centralized training setting.

Weirui Kuang · Answer 7 · Thu Jan 11 2024 19:39:06 GMT+0800 (China Standard Time)

You can set cfg.train.data_para_dids = [] # torch.nn.DataParallel devices to enable DataParallel training.