Simplify dataset code
nirum opened this issue · comments
Niru Maheswaranathan commented
Currently, the datasets.py
module has a number of helper functions for loading data. However, these functions are a mishmash of multiple responsibilities:
- Loading raw data, from either TFDS or csv files
- Tokenization
- Getting inputs / labels / index into a standard format
- Apply custom filters/transformations
- Batching
- Caching / Shuffling
It might make sense to refactor the code a bit to more cleanly express this pipeline, and do so in a way that lets users customize it efficiently. Also, that way it's easier to reason about what custom filters/transformations are doing.