andnp / PyExpUtils

Experiment utility code, specifically designed for use with Compute Canada.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data format v3

andnp opened this issue · comments

It's time to allow partial data to be stored in the results folder. Currently, we hold all data in memory (in the Collector object) until the end of the run. Then we dump that data into the results database all at once. Doing this allows us to avoid incomplete/partial states. There are several drawbacks, though:

  • This costs a lot of memory, particularly for long running experiments (e.g. continual learning experiments)
  • This puts a large sudden write-pressure on the database at the end of a run. If many parallel processes are doing this at once, this dramatically increases the chances of long lockups
  • The full collector object needs to be checkpointed currently, which can be very expensive as it fills up

There are several challenges for the format, some of which already exist and we are simply ignoring them because the probability of hitting the issue is small, but no longer:

  • Data consistency. If a run is preempted, we need to ensure that restarting the run doesn't invalidate the data. Possibly this means needing to delete some overlapping rows or other safeguards. Preferably this is done only once, for instance when the checkpoint loads.
  • Buffered writes. We don't want to hit the database frequently, too much potential for lockups and too much disk activity.
  • Database lock handling.

Some concrete implementation notes:

  • Experiment description -> experiment metadata -> experiment. The description should be used once to construct the metadata in a db. Then the experiment will be run from that db only. This allows us to synchronize states and retain consistency. This also makes the path for computed .py experiment descriptions much simpler.
    • Need to handle synchronizing the metadata db across local and server machines. Possibly embedding this into the results database is sufficient
  • Use some tmp location (e.g. SLURM_TMPDIR) for partial results db, then synchronize that with the final results db. This should alleviate write pressure and make buffering less important
  • Have db writer occur asynchronously in a background thread.