ZeRO without using shard_param

Question

ZeRO without using shard_param

powermano opened this issue 2 years ago · comments

🐛 Describe the bug

When i use ZeRO without shard_params, it occurs the following problems

Traceback (most recent call last):
  File "train.py", line 175, in <module>
    main()
  File "train.py", line 39, in main
    with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
  File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 75, in __init__
    self.config = ZeroContextConfig(target_device=target_device, replicated=True, shard_param=shard_param)
  File "/usr/local/Python-3.8.6/lib/python3.8/site-packages/colossalai/zero/init_ctx/init_context.py", line 37, in __init__
    assert target_device.type == 'cuda', "Replicated no-shard paramters should locate in cuda."
AttributeError: 'int' object has no attribute 'type'

My init code is:

def main():
    parser = colossalai.get_default_parser()
    parser.add_argument('--use_trainer', action='store_true', help='whether to use trainer')
    args = parser.parse_args()

    colossalai.launch_from_torch(config='./config.py')

    logger = get_dist_logger()

    rank = int(os.environ['RANK'])
    # build resnet
    use_zero3 = hasattr(gpc.config, 'zero')
    if use_zero3:
        shard_strategy = TensorShardStrategy()
        with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
            model = resnet34(num_classes=10)
    else:
        model = resnet34(num_classes=10)

my config is

from colossalai.amp import AMP_TYPE
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.nn.optimizer import HybridAdam

zero = dict(
    model_config=dict(
        tensor_placement_policy='cuda',
        shard_strategy=TensorShardStrategy(),
        reuse_fp16_shard=False
    ),
    optimizer_config=dict()
)

optimizer = dict(
    type=HybridAdam,
    lr=0.001,
    # weight_decay=1e-2,
)

BATCH_SIZE = 64
NUM_EPOCHS = 20
LOGGING_FREQUNCE = 20
OUTPUT = './'

gradient_clipping = 5.0

Environment

pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org

ubuntu 18.04

victory · Answer 1 · Tue Jun 07 2022 17:17:08 GMT+0800 (China Standard Time)

If I modified the code as following, it actually worked.

 rank = int(os.environ['RANK'])
    # build resnet
  use_zero3 = hasattr(gpc.config, 'zero')
  if use_zero3:
      shard_strategy = TensorShardStrategy()
      
      # with ZeroInitContext(target_device=torch.cuda.current_device(), shard_strategy=shard_strategy, shard_param=False):
      #     model = resnet34(num_classes=10)
      with ZeroInitContext(target_device=torch.device('cuda', rank), shard_strategy=shard_strategy, shard_param=False):
          model = resnet34(num_classes=10)

victory · Answer 2 · Wed Jun 08 2022 10:31:50 GMT+0800 (China Standard Time)

@fastalgo I do not know how to save the ZeRO model params. When using the save_checkpoint API , the saved file is pretty small.

victory · Answer 3 · Wed Jun 08 2022 16:36:53 GMT+0800 (China Standard Time)

I tested the ZeRO using private dataset and ir18(which a lit bit different with origin resnet18). The following tabel is the specific results.
When i used pytorch origin amp, the gpu memory is much smaller than colossai, why?
the config is

from colossalai.amp import AMP_TYPE
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.nn.optimizer import HybridAdam

fp16 = dict(
    mode=AMP_TYPE.TORCH,
)

optimizer = dict(
    type=HybridAdam,
    lr=0.001,
    # weight_decay=1e-2,
)

model	dataset	machine	batch	gradient accmulate size	ZeRO	speed	GPU memory	OPT	tensor_placement_policy
ir18	private dataset	1	64	1	no ZeRO	24%\|██▍ \| 2089/8549 [02:51<08:39, 12.43it/s]	8703M	HybridAdam		single machine + Engine
ir18	private dataset	1	64	1	no ZeRO	19%\|█▊ \| 1599/8549 [02:24<10:21, 11.17it/s]	5769M	HybridAdam		single machine + wo Engine + pytorch origin fp16
ir18	private dataset	2	64	2	no ZeRO	37%\|███▋ \| 1598/4274 [02:32<04:14, 10.50it/s]	9011M	SGD		common data paralle
ir18	private dataset	2	64	1	ZeRO + No shard params	14%\|█▍ \| 606/4275 [01:25<08:27, 7.23it/s]	9141M	HybridAdam	cuda
ir18	private dataset	2	64	1	ZeRO + shard params	13%\|█▎ \| 571/4275 [01:32<10:32, 5.85it/s]	9073M	HybridAdam	cuda
ir18	private dataset	2	64	1	ZeRO + shard params	5%\|▌ \| 217/4275 [01:37<29:16, 2.31it/s]	6819M	HybridAdam	cpu

victory · Answer 4 · Wed Jun 08 2022 16:46:14 GMT+0800 (China Standard Time)

the code without using Engine is shown as following:

model = ...
optimizer = ...
criterion = ...
amp = torch.cuda.amp.grad_scaler.GradScaler(growth_interval=1000)
global_step = 0
optimizer.zero_grad()
for epoch in range(gpc.config.NUM_EPOCHS):
    model.train()
    for idx, (img, label) in enumerate(train_dl):
        img = img.cuda()
        label = label.cuda()
        output, _ = model(img, label)
        train_loss = criterion(output, label)
        amp.scale(train_loss).backward()
        amp.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
        amp.step(optimizer)
        amp.update()   
        optimizer.zero_grad()

victory · Answer 5 · Thu Jun 09 2022 08:59:10 GMT+0800 (China Standard Time)

The difference between my origin pytorch implementation and colossai is convert_to_amp API which using TorchAMPModel to decorate the origin model.
I have tested three different cases:

1 using torch.cuda.amp.autocast(True) inside model forward function:

class model(nn.Module):
    def __init__():
        ....
    def forward(self, x, label):
        with torch.cuda.amp.autocast(True):
              .....
              .....
        return x

2 using @torch.cuda.amp.autocast()

class model(nn.Module):
    def __init__():
        ....
    @torch.cuda.amp.autocast()
    def forward(self, x, label):
              .....
              .....
        return x

3 using TorchAMPModel

class model(nn.Module):
    def __init__():
        ....
    def forward(self, x, label):
              .....
              .....
        return x

model = model()
model = TorchAMPModel(model)

The first two are normal and only need 5769M GPU memory, but the third one needs 8703M GPU memory

victory · Answer 6 · Thu Jun 09 2022 09:04:45 GMT+0800 (China Standard Time)

@feifeibear Can you help to verify the above problem？ Thanks.

binmakeswell · Answer 7 · Tue Jun 21 2022 17:13:44 GMT+0800 (China Standard Time)

hpcaitech/ColossalAI#1082
Solved