Composer crashes when attempting to load sharded checkpoint

Question

Composer crashes when attempting to load sharded checkpoint

growlix opened this issue 7 months ago · comments

When attempting load a sharded checkpoint, we (@prigoyal and I) hit the following error:

595 │ /usr/lib/python3/dist-packages/composer/utils/checkpoint.py:287 in           │
596 │ load_checkpoint                                                              │
597 │                                                                              │
598 │    284 │   │   using_legacy_sharded = is_checkpoint_legacy_sharded(object_st │
599 │    285 │                                                                     │
600 │    286 │   if state.fsdp_elastic_sharded_enabled and not using_legacy_sharde │
601 │ ❱  287 │   │   rng_state_dicts = load_sharded_checkpoint(                    │
602 │    288 │   │   │   source_path=path,                                         │
603 │    289 │   │   │   state=state,                                              │
604 │    290 │   │   │   logger=logger,                                            │
605 │                                                                              │
606 │ /usr/lib/python3/dist-packages/composer/utils/checkpoint.py:530 in           │
607 │ load_sharded_checkpoint                                                      │
608 │                                                                              │
609 │    527 │   │   │                                                             │
610 │    528 │   │   │   # 2. Optionally load optimizer                            │
611 │    529 │   │   │   if not load_weights_only:                                 │
612 │ ❱  530 │   │   │   │   optim_state = load_sharded_optimizer_state_dict(model │
613 │    531 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   optim │
614 │    532 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   stora │
615 │    533 │   │   │   │   state.load_optim_state(optim_state)                   │
616 │                                                                              │
617 │ /usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py:264 │
618 │ in load_sharded_optimizer_state_dict                                         │
619 │                                                                              │
620 │   261 │   """                                                                │
621 │   262 │   metadata = storage_reader.read_metadata()                          │
622 │   263 │                                                                      │
623 │ ❱ 264 │   layout_specs, dp_pg = _get_state_dict_2d_layout(model_state_dict)  │
624 │   265 │   dp_pg_device_type = dist.distributed_c10d._get_pg_default_device(d │
625 │   266 │   device_module = _get_device_module(dp_pg_device_type)              │
626 │   267                                                                        │
627 │                                                                              │
628 │ /usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py:128 │
629 │ in _get_state_dict_2d_layout                                                 │
630 │                                                                              │
631 │   125 │   specs: STATE_DICT_2D_LAYOUT = {}                                   │
632 │   126 │   dp_pg: Optional[dist.ProcessGroup] = None                          │
633 │   127 │   for key, value in state_dict.items():                              │
634 │ ❱ 128 │   │   specs[key] = (None, value.size())                              │
635 │   129 │   │   if _is_nested_tensor(value):                                   │
636 │   130 │   │   │   assert (                                                   │
637 │   131 │   │   │   │   len(value.local_shards()) == 1                         │
638 ╰──────────────────────────────────────────────────────────────────────────────╯
639 AttributeError: '_io.BytesIO' object has no attribute 'size'

Environment

0: Collecting system information...
0: ---------------------------------
0: System Environment Report
0: Created: 2024-02-27 02:31:05 UTC
0: ---------------------------------
0:
0: PyTorch information
0: -------------------
0: PyTorch version: 2.1.0+cu121
0: Is debug build: False
0: CUDA used to build PyTorch: 12.1
0: ROCM used to build PyTorch: N/A
0:
0: OS: Ubuntu 20.04.6 LTS (x86_64)
0: GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
0: Clang version: Could not collect
0: CMake version: version 3.16.3
0: Libc version: glibc-2.31
0:
0: Python version: 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] (64-bit runtime)
0: Python platform: Linux-5.15.0-1047-aws-x86_64-with-glibc2.31
0: Is CUDA available: True
0: CUDA runtime version: 12.1.105
0: CUDA_MODULE_LOADING set to: LAZY
0: GPU models and configuration:
0: GPU 0: NVIDIA H100 80GB HBM3
0: GPU 1: NVIDIA H100 80GB HBM3
0: GPU 2: NVIDIA H100 80GB HBM3
0: GPU 3: NVIDIA H100 80GB HBM3
0: GPU 4: NVIDIA H100 80GB HBM3
0: GPU 5: NVIDIA H100 80GB HBM3
0: GPU 6: NVIDIA H100 80GB HBM3
0: GPU 7: NVIDIA H100 80GB HBM3
0:
0: Nvidia driver version: 535.104.12
0: cuDNN version: Probably one of the following:
0: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
0: /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
0: HIP runtime version: N/A
0: MIOpen runtime version: N/A
0: Is XNNPACK available: True
0:
0: CPU:
0: Architecture:                       x86_64
0: CPU op-mode(s):                     32-bit, 64-bit
0: Byte Order:                         Little Endian
0: Address sizes:                      48 bits physical, 48 bits virtual
0: CPU(s):                             192
0: On-line CPU(s) list:                0-191
0: Thread(s) per core:                 2
0: Core(s) per socket:                 48
0: Socket(s):                          2
0: NUMA node(s):                       2
0: Vendor ID:                          AuthenticAMD
0: CPU family:                         25
0: Model:                              1
0: Model name:                         AMD EPYC 7R13 Processor
0: Stepping:                           1
0: CPU MHz:                            2650.000
0: BogoMIPS:                           5300.00
0: Hypervisor vendor:                  KVM
0: Virtualization type:                full
0: L1d cache:                          3 MiB
0: L1i cache:                          3 MiB
0: L2 cache:                           48 MiB
0: L3 cache:                           384 MiB
0: NUMA node0 CPU(s):                  0-47,96-143
0: NUMA node1 CPU(s):                  48-95,144-191
0: Vulnerability Gather data sampling: Not affected
0: Vulnerability Itlb multihit:        Not affected
0: Vulnerability L1tf:                 Not affected
0: Vulnerability Mds:                  Not affected
0: Vulnerability Meltdown:             Not affected
0: Vulnerability Mmio stale data:      Not affected
0: Vulnerability Retbleed:             Not affected
0: Vulnerability Spec rstack overflow: Mitigation; safe RET
0: Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
0: Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
0: Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
0: Vulnerability Srbds:                Not affected
0: Vulnerability Tsx async abort:      Not affected
0: Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
0:
0: Versions of relevant libraries:
0: [pip3] numpy==1.26.2
0: [pip3] pytorch-ranger==0.1.1
0: [pip3] torch==2.1.0+cu121
0: [pip3] torch-optimizer==0.3.0
0: [pip3] torchmetrics==1.0.3
0: [pip3] torchvision==0.16.0+cu121
0: [pip3] triton==2.1.0
0: [pip3] triton-pre-mlir==2.0.0
0: [conda] Could not collect
0:
0:
0: Composer information
0: --------------------
0: Composer version: 0.17.2
0: Composer commit hash: None
0: Host processor model name: AMD EPYC 7R13 Processor
0: Host processor core count: 96
0: Number of nodes: 1
0: Accelerator model name: NVIDIA H100 80GB HBM3
0: Accelerators per node: 1
0: CUDA Device Count: 8
0:
0:
-->

To reproduce

Steps to reproduce the behavior:

Save a model checkpoint by setting fsdp_config.state_dict: sharded in the config.
Attempt to load it by setting load_path to the directory containing the checkpoint files.

Expected behavior

The checkpoint should be loaded and the model should continue training and/or evaluating.

Additional context

Hanlin Tang · Answer 1 · Tue Feb 27 2024 12:57:53 GMT+0800 (China Standard Time)

Hello @growlix , are you running this in fp8?

If so, this issue was fixed in mosaicml/composer#2907 and released in v0.19.0, so you should upgrade your composer version.

Matthew · Answer 2 · Tue Feb 27 2024 13:04:26 GMT+0800 (China Standard Time)

Thank you so much, @hanlint! We are running in fp8. We'll update to v0.19.0 and give it a whirl!

Priya Goyal · Answer 3 · Wed Feb 28 2024 00:18:53 GMT+0800 (China Standard Time)

@hanlint , we tried composer 0.19.0 but we are still hitting the issue . Is there any change to the config we need to make?
we are specifying the load path as the shard prefix following this

30     trainer = Trainer(
31   File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1493, in __init__
32     self._rng_state = checkpoint.load_checkpoint(
33   File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 366, in load_checkpoint
34     rng_state_dicts = load_sharded_checkpoint(
35   File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 558, in load_sharded_checkpoint
36     optim_state = load_sharded_optimizer_state_dict(model_state_dict=state.state_dict()['model'],
37   File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 264, in load_sharded_optimizer_state_dict
38     layout_specs, dp_pg = _get_state_dict_2d_layout(model_state_dict)
39   File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 128, in _get_state_dict_2d_layout
40     specs[key] = (None, value.size())
41 AttributeError: '_io.BytesIO' object has no attribute 'size'
42 Traceback (most recent call last):
43   File "/fsx/users/prigoyal/experiments/prigoyal/science/20240227-16-10-13_bump-composer/bench-MPT1b-RPJ-fp8-noactckpt-noaccum-bs160-v5docker-flash-noqknorm-sharded-resume15ba/science/tools/train_llms.py", line 632, in <module>
44     main(cfg)
45   File "/fsx/users/prigoyal/experiments/prigoyal/science/20240227-16-10-13_bump-composer/bench-MPT1b-RPJ-fp8-noactckpt-noaccum-bs160-v5docker-flash-noqknorm-sharded-resume15ba/science/tools/train_llms.py", line 564, in main
46     trainer = Trainer(
47   File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1493, in __init__
48     self._rng_state = checkpoint.load_checkpoint(
49   File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 366, in load_checkpoint
50     rng_state_dicts = load_sharded_checkpoint(
51   File "/usr/lib/python3/dist-packages/composer/utils/checkpoint.py", line 558, in load_sharded_checkpoint
52     optim_state = load_sharded_optimizer_state_dict(model_state_dict=state.state_dict()['model'],
53   File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 264, in load_sharded_optimizer_state_dict
54     layout_specs, dp_pg = _get_state_dict_2d_layout(model_state_dict)
55   File "/usr/lib/python3/dist-packages/torch/distributed/checkpoint/optimizer.py", line 128, in _get_state_dict_2d_layout
56     specs[key] = (None, value.size())
57 AttributeError: '_io.BytesIO' object has no attribute 'size'