grad accum tests failing on gpu w/ amp_bf16 precision
sagadre opened this issue · comments
changing precision from fp32
to amp_bf16
leads to pytest tests/test_grad_accum.py
failing
FAILED tests/test_grad_accum.py::test_grad_acc - AssertionError: Failed gradient checks at: ['tok_embeddings.weight', 'layers.0.attention.in_proj.weight', 'layers.0...
FAILED tests/test_grad_accum.py::test_grad_acc_fsdp - torch.multiprocessing.spawn.ProcessRaisedException: