Simplify dataset code

Question

Simplify dataset code

nirum opened this issue 4 years ago · comments

Niru Maheswaranathan commented 4 years ago

Currently, the datasets.py module has a number of helper functions for loading data. However, these functions are a mishmash of multiple responsibilities:

Loading raw data, from either TFDS or csv files
Tokenization
Getting inputs / labels / index into a standard format
Apply custom filters/transformations
Batching
Caching / Shuffling

It might make sense to refactor the code a bit to more cleanly express this pipeline, and do so in a way that lets users customize it efficiently. Also, that way it's easier to reason about what custom filters/transformations are doing.