pytorch / torchtitan

A native PyTorch Library for large model training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hard release criteria: Run and get convergence data on long running tests

gnadathur opened this issue · comments

  • Run on 64 A100
  • Later on 64 H100

What are the hyper parameters for convergence run ?

  • adjusted batch size to 1.
  • What should the learning rate be ? @wanchaol , @lessw2020 , maybe duplicate the earlier convergence tests from FSDP1.