IBM / ModuleFormer

ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward experts. We released a collection of ModuleFormer-based Language Models (MoLM) ranging in scale from 4 billion to 8 billion parameters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

torch.dtype is not respected

Vectorrent opened this issue · comments

commented

Great implementation. Easy to read, very few lines of code, and I'm glad you extended the Huggingface API - because it's really good, too! I've already pretrained a couple ModuleFormers, and they're working well!

Sadly, torch.dtype doesn't seem to work.

For now, I'm able to use model.half(), and it seems to work alright... but it would be better if I didn't have to load the full model into memory, just before halving it again. And it would be REALLY nice to have bitsandbytes 4/8-bit support.

Just opening this issue for posterity. I may try to tackle the problem some day (though I wouldn't even know how to start, now).

Thanks again!

Hi, thanks for raising this issue. Please feel free to post the error message here. We can look into it together.
The model parameters are stored in BF16. You can directly load it in 16 bits. The 4/8-bit support should be straightforward for Moduleformer, because it only relies on standard pytorch operations (e.g. linear). We also plan to look into it later. And we are also open to accepting commits to add this support.

commented

Thanks for the response. Here is the sample code I'm working with:

import torch
from moduleformer import (
    ModuleFormerConfig,
    ModuleFormerForCausalLM,
    ModuleFormerForSequenceClassification,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    AutoTokenizer,
)

AutoConfig.register("moduleformer", ModuleFormerConfig)
AutoModelForCausalLM.register(ModuleFormerConfig, ModuleFormerForCausalLM)
AutoModelForSequenceClassification.register(
    ModuleFormerConfig, ModuleFormerForSequenceClassification
)

model_name = "ibm/MoLM-350M-4B"

tokenizer = AutoTokenizer.from_pretrained(
    model_name, cache_dir="/data/models", padding_side="left"
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir="/data/models",
    output_hidden_states=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

string = "Once upon a time,"

inputs = tokenizer(string, return_tensors="pt")
generated = model.generate(
    input_ids=inputs["input_ids"].to(model.device.type),
    attention_mask=inputs["attention_mask"].to(model.device.type),
    max_new_tokens=33,
    do_sample=True,
    temperature=0.7,
    top_k=4,
    penalty_alpha=0.6,
    eta_cutoff=0.0003,
    repetition_penalty=2.3,
    no_repeat_ngram_size=9,
    output_hidden_states=True,
    return_dict_in_generate=True,
)
string = tokenizer.decode(generated["sequences"][0], skip_special_tokens=False)
print(string)

This example does load and run inference. However, there is a warning:

vtx-lab-1  | /usr/local/lib/python3.10/dist-packages/torch/jit/annotations.py:386: UserWarning: TorchScript will treat type annotations of Tensor dtype-specific subtypes as if they are normal Tensors. dtype constraints are not enforced in compilation either.

Whether loaded in bfloat16 or float32, the model uses ~6GB of VRAM. However, oddly, float32 will also consume 10+ GB of system RAM, whereas bfloat16 does not.

If I try to load the model in 4-bit quantization with bitsandbytes, I get an error:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir="/data/models",
    output_hidden_states=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Results in this error:

vtx-lab-1  | Traceback (most recent call last):
vtx-lab-1  |   File "/src/main.py", line 12, in <module>
vtx-lab-1  |     main()
vtx-lab-1  |   File "/src/main.py", line 8, in main
vtx-lab-1  |     import machine
vtx-lab-1  |   File "/src/machine.py", line 10, in <module>
vtx-lab-1  |     from lab import dev
vtx-lab-1  |   File "/src/lab/dev.py", line 49, in <module>
vtx-lab-1  |     model = AutoModelForCausalLM.from_pretrained(
vtx-lab-1  |   File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
vtx-lab-1  |     return model_class.from_pretrained(
vtx-lab-1  |   File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3261, in from_pretrained
vtx-lab-1  |     modules_to_not_convert = get_keys_to_not_convert(model)
vtx-lab-1  |   File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/bitsandbytes.py", line 256, in get_keys_to_not_convert
vtx-lab-1  |     tied_model = deepcopy(model)  # this has 0 cost since it is done inside `init_empty_weights` context manager`
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
vtx-lab-1  |     state = deepcopy(state, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
vtx-lab-1  |     y = copier(x, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
vtx-lab-1  |     y[deepcopy(key, memo)] = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 297, in _reconstruct
vtx-lab-1  |     value = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
vtx-lab-1  |     state = deepcopy(state, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
vtx-lab-1  |     y = copier(x, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
vtx-lab-1  |     y[deepcopy(key, memo)] = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 297, in _reconstruct
vtx-lab-1  |     value = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
vtx-lab-1  |     state = deepcopy(state, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
vtx-lab-1  |     y = copier(x, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
vtx-lab-1  |     y[deepcopy(key, memo)] = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 297, in _reconstruct
vtx-lab-1  |     value = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
vtx-lab-1  |     state = deepcopy(state, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
vtx-lab-1  |     y = copier(x, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
vtx-lab-1  |     y[deepcopy(key, memo)] = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 297, in _reconstruct
vtx-lab-1  |     value = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
vtx-lab-1  |     y = _reconstruct(x, memo, *rv)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
vtx-lab-1  |     state = deepcopy(state, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
vtx-lab-1  |     y = copier(x, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
vtx-lab-1  |     y[deepcopy(key, memo)] = deepcopy(value, memo)
vtx-lab-1  |   File "/usr/lib/python3.10/copy.py", line 161, in deepcopy
vtx-lab-1  |     rv = reductor(4)
vtx-lab-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/jit/_script.py", line 69, in _reduce
vtx-lab-1  |     raise pickle.PickleError("ScriptFunction cannot be pickled")
vtx-lab-1  | _pickle.PickleError: ScriptFunction cannot be pickled

Finally, I'll leave you with the output of the transformers-cli env command:

- `transformers` version: 4.35.2
- Platform: Linux-6.5.9-arch2-1-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.0
- Accelerate version: 0.24.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: True
- Using distributed or parallel set-up in script?: False

I'm looking forward to getting this fixed. Let me know if there's anything I can do to contribute!

commented

And to follow up on the model.half() thing - it appears that this is not necessary when running inference. Whether you call model.half() or not, the model loads in the correct dtype and memory usage is what you would expect.

However, I'm hitting this problem while pretraining my own model. If I set the dtype to bfloat16, and try to train with Pytorch Lightning, I run out of memory:

Traceback (most recent call last):
  File "/src/harness.py", line 517, in <module>
    main()
  File "/src/harness.py", line 272, in main
    prototype.train(
  File "/src/aigen/aigen/aigen.py", line 715, in train
    trainer.fit(train_model, final_train, val_split)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 180, in run
    closure()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 135, in closure
    self._backward_fn(step_output.closure_loss)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 236, in backward_fn
    call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 204, in backward
    self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 69, in backward
    model.backward(tensor, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1069, in backward
    loss.backward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/autocast_mode.py", line 140, in decorate_bwd
    return bwd(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/moduleformer/utils/parallel_experts.py", line 86, in backward
    return ParallelLinear.backward_scriptable(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/moduleformer/utils/parallel_experts.py", line 115, in backward_scriptable
        d_input_buf = torch.empty_like(input)
        d_input_buf_list = d_input_buf.split(expert_size_list, dim=0)
        d_weight_buf = torch.empty_like(weight)
                       ~~~~~~~~~~~~~~~~ <--- HERE
    
        weight_t = weight.permute(0, 2, 1)
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 7.92 GiB of which 26.50 MiB is free. Process 2832140 has 7.89 GiB memory in use. Of the allocated memory 7.74 GiB is allocated by PyTorch, and 37.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

However, if I call model.half() on my model before training it, then memory usage is much lower.

Thus, the bug is somehow related to training, and I'm not sure if it's me, or the model. I don't have this problem with other models.

Anyway, here's my training loop, if that would help.

EDIT: I guess model.half() is casting all weights into fp16 - even if they should remain in fp32. So yeah, it reduces VRAM... but it's probably not a good idea. Still, when the model is loaded in bfloat16, and saved with save_pretrained(), it's in float32:
image
Surely this isn't expected, right? So far as I can tell, float32/float16/bfloat32 all require the exact same amount of VRAM for training (i.e. a lot; ~8GB for a 118M parameter model).

commented

Well heck, after reducing my model's size by a lot, 8-bit quantization training does seem to work. VRAM is slightly reduced as well.

4-bit does train, but the model cannot be saved:

Traceback (most recent call last):
  File "/src/harness.py", line 517, in <module>
    main()
  File "/src/harness.py", line 272, in main
    prototype.train(
  File "/src/aigen/aigen/aigen.py", line 719, in train
    self.save(output_dir)
  File "/src/aigen/aigen/aigen.py", line 727, in save
    self.model.save_pretrained(target_folder)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2063, in save_pretrained
    model_to_save.config.save_pretrained(save_directory)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 460, in save_pretrained
    self.to_json_file(output_config_file, use_diff=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 940, in to_json_file
    writer.write(self.to_json_string(use_diff=use_diff))
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 926, in to_json_string
    return json.dumps(config_dict, indent=2, sort_keys=True) + "\n"
  File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type dtype is not JSON serializable

Overall, VRAM requirements seem very high across the board, but I suppose that's to be expected. I did read somewhere that this MoE architecture does require more VRAM than traditional, dense models.

For the VRAM problem, you can set gradient_checkpointing=True in this line. This will save a large amount of memory. We pretrained the model with BF16. It should use less VRAM than FP32.

commented

Thanks for the suggestion; gradient checkpointing does indeed save a lot of memory.

At this point, I think I'll close this issue. I think most of my questions are answered, and most of these issues are resolved.

I created a PR that implements the gradient_checkpointing_enable() method, if that's something you want.