Find a way to not load all the tasks infos.
thomasw21 opened this issue · comments
When running from promptsource.seqio_tasks import tasks
it takes a huge amount of time. One of the main reasons is this queries all dataset infos:
- One has to load ALL dataset infos as soon as one uses one task.
- Even when cached, it still queries urls to check that it didn't change. One can bypass this point by passing
HF_DATASETS_OFFLINE=1
as described in #703 (comment)
IMO both are unnecessary and should be fixed. Is there a reasons why one cannot load seqio tasks dynamically, in the sense of fetching only what is necessary? Something along the lines of:
def add_seqio_task(task_name):
seqio.TaskRegistry.add(...)
In order to use the module import functionality of seqio, importing the module needs to add the task you want to use to the task registry without calling any additional code. So, we either need to have a separate file for each task or change the underlying functionality in HF datasets.