怎么保存checkpoint呢

Question

怎么保存checkpoint呢

gctian opened this issue a year ago · comments

gctian commented a year ago

🚀 The feature

没训练一段时间保存一个checkpoint，现在好像指定epochs参数，就一直要等到训练完才有一个模型，也不知道这个模型是哪个 step产生的

yuxin.wang · Answer 1 · Fri Aug 04 2023 11:02:12 GMT+0800 (China Standard Time)

抱歉，这一点没有在文档中提及。FineTuner.run 函数有控制这个行为的参数 save_on_epoch_end，设置为 True 就会在每个轮次之间保存 checkpoint，文档债一直没还呢.... 最近太忙了

finetuner.run(save_on_epoch_end=True)

Chen · Answer 2 · Mon Aug 07 2023 11:12:40 GMT+0800 (China Standard Time)

但是现在的多卡代码，保存checkpoint会报错。。。。每个GPU都会独立保存，然后回爆出文件已存在的错误

yuxin.wang · Answer 3 · Mon Aug 07 2023 13:44:43 GMT+0800 (China Standard Time)

请问您使用多卡训练的策略是什么？FSDP ,deepspeed 还是？具体的配置可以发一下吗？

Chen · Answer 4 · Mon Aug 07 2023 14:44:53 GMT+0800 (China Standard Time)

我就是按照您的代码，用accelerate进行多卡的训练，不过我看目前的框架，更像是DP而不是DDP，然后再checkpoint保存的时候，每个GPU都想单独保存一份模型，导致文件重复出现。。。。目前我的想法是重写一下其中的一些框架，不知道除此之外您还有没有其他的意见呢？

yuxin.wang · Answer 5 · Mon Aug 07 2023 15:34:46 GMT+0800 (China Standard Time)

accelerate 可以支持 DP,DDP,FSDP,DeepSpeed 等，可以自己选择的，因为不同的分布式方法，在保存 checkpoint 时都有区别，可能有的方案就会有这个 bug，我主要是想复现一下。

我给出的代码其实很简单，重写部分代码或者框架完全可行。

grok · Answer 6 · Tue Aug 08 2023 10:42:21 GMT+0800 (China Standard Time)

fsdp会报错，辛苦能否给个解决方案！

Chen · Answer 7 · Tue Aug 08 2023 13:07:27 GMT+0800 (China Standard Time)

单机多卡保存checkpoint时会出现下面这个问题：

ValueError: Checkpoint directory experiments/test/checkpoints/checkpoint_0 (0)
already exists. Please manually override self.save_iteration with what
iteration to start with.

排查了一下，发现是运行到trainer.py的这一个函数时报错：

if self.save_on_epoch_end:
self.accelerator.save_state()

请问有什么解决方案吗？下面是我的accelerate的config：
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Chen · Answer 8 · Tue Aug 08 2023 13:09:08 GMT+0800 (China Standard Time)

fsdp会报错，辛苦能否给个解决方案！

请问你爆的是这个错吗：
ValueError: Checkpoint directory experiments/test/checkpoints/checkpoint_0 (0)
already exists. Please manually override self.save_iteration with what
iteration to start with.

你解决了这个问题了吗

yuxin.wang · Answer 9 · Tue Aug 08 2023 16:03:18 GMT+0800 (China Standard Time)

fsdp会报错，辛苦能否给个解决方案！

使用最新的代码可以解决这个问题，其实是 acclerate 这个包的 bug，huggingface 背锅，哈哈。

pip install -U uniem

yuxin.wang · Answer 10 · Tue Aug 08 2023 16:03:31 GMT+0800 (China Standard Time)

fsdp会报错，辛苦能否给个解决方案！

请问你爆的是这个错吗： ValueError: Checkpoint directory experiments/test/checkpoints/checkpoint_0 (0) already exists. Please manually override self.save_iteration with what iteration to start with.

你解决了这个问题了吗

使用最新的代码可以解决这个问题，其实是 acclerate 这个包的 bug，huggingface 背锅，哈哈。

pip install -U uniem

Oasis · Answer 11 · Thu Aug 10 2023 11:12:04 GMT+0800 (China Standard Time)

请问保存的checkpoint如何使用呢？直接使用pytorch_model.bin 替换原本模型的pytorch_model.bin就行吗？感谢

yuxin.wang · Answer 12 · Fri Aug 11 2023 15:45:27 GMT+0800 (China Standard Time)

保存 checkpoint 的目的是为了重新进行训练流程，如果是为了推理的话，直接用保存好的模型就行了。