broadinstitute / ml4h

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Determine cause of MGH workstation crashes

erikr opened this issue · comments

What
olympus and cybertronpc keep crashing. Is this due to RAM shortage, or PSU limitation?

Why
Crashes interrupt training and require manual reboot.

How
Determine how much RAM is used by train mode on ECGs.
Determine power draw during train and see if exceeds supply (we only have 450W PSUs and may need 650W+).

Acceptance Criteria

  • decide if we need to revamp code to be more RAM-efficient
    and/or
  • buy larger PSUs

Since running train on olympus results in it crashing, I don't think it'd be good to determine RAM/power usage there. Instead, I'll try to log info on mithril

@ndiamant had montserrat ever crashed if you were training models on it?