AssertionError related to tied parameters during `train_tiny_llama.sh` execution
xffxff opened this issue · comments
zhou fan commented
I encountered an error while executing the examples/train_tiny_llama.sh
script from commit ff3c774, without any modifications to the code or configurations, on an 8-GPU node. Below is the error log:
[default5]: self._mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default7]: File "nanotron/src/nanotron/trainer.py", line 767, in _mark_tied_parameters
[default6]: self._mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default4]: len(dp_ranks) == 1
[default5]: File "nanotron/src/nanotron/trainer.py", line 767, in _mark_tied_parameters
[default6]: File "nanotron/src/nanotron/trainer.py", line 767, in _mark_tied_parameters
[default7]: mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default5]: mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default4]:AssertionError: Tying weights has to happen with a replica of a model. Got the ranks from the following replicas: (0, 1)
[default6]: mark_tied_parameters(model=model, parallel_context=parallel_context, parallel_config=parallel_config)
[default7]: File "nanotron/src/nanotron/trainer.py", line 790, in mark_tied_parameters
[default5]: File "nanotron/src/nanotron/trainer.py", line 790, in mark_tied_parameters
[default5]: tie_parameters(
[default6]: File "nanotron/src/nanotron/trainer.py", line 790, in mark_tied_parameters
[default7]: tie_parameters(
[default6]: tie_parameters(
[default5]: File "nanotron/src/nanotron/parallel/tied_parameters.py", line 60, in tie_parameters
[default7]: File "nanotron/src/nanotron/parallel/tied_parameters.py", line 60, in tie_parameters
[default6]: File "nanotron/src/nanotron/parallel/tied_parameters.py", line 60, in tie_parameters
[default5]: len(dp_ranks) == 1
[default7]: len(dp_ranks) == 1
[default6]: len(dp_ranks) == 1
[default5]:AssertionError: Tying weights has to happen with a replica of a model. Got the ranks from the following replicas: (0, 1)
[default7]:AssertionError: Tying weights has to happen with a replica of a model. Got the ranks from the following replicas: (0, 1)
[default6]:AssertionError: Tying weights has to happen with a replica of a model. Got the ranks from the following replicas: (0, 1)
Nouamane Tazi commented
thanks for reporting this! Can you try with this fix: #100
zhou fan commented
Sorry. I cannot reproduce the issue at this moment. It's possible that a previous configuration error on my part led to the problem. I will close this issue now