mlfoundations / open_lm

A repository for research on medium sized language models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve dataloading.

GeorgiosSmyrnis opened this issue · comments

Some items that need to be addressed:

  • Clean up the code in data.py.
  • Make --dataset-resampled and --dataset-manifest the only possible options.
  • Make --accurate-total-tokens the default.

Should we close this @GeorgiosSmyrnis? :)

Improvement is a continuous process :)

But I agree, closing this thanks to #111 being merged.