bigscience-workshop / promptsource

Toolkit for creating, sharing and using natural language prompts.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Find a way to not load all the tasks infos.

thomasw21 opened this issue · comments

When running from promptsource.seqio_tasks import tasks it takes a huge amount of time. One of the main reasons is this queries all dataset infos:

dataset_splits = utils.get_dataset_splits(dataset_name, subset_name)
This is problematic for two reasons:

  • One has to load ALL dataset infos as soon as one uses one task.
  • Even when cached, it still queries urls to check that it didn't change. One can bypass this point by passing HF_DATASETS_OFFLINE=1 as described in #703 (comment)

IMO both are unnecessary and should be fixed. Is there a reasons why one cannot load seqio tasks dynamically, in the sense of fetching only what is necessary? Something along the lines of:

def add_seqio_task(task_name):
    seqio.TaskRegistry.add(...)

In order to use the module import functionality of seqio, importing the module needs to add the task you want to use to the task registry without calling any additional code. So, we either need to have a separate file for each task or change the underlying functionality in HF datasets.