google-research / FLAN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] How much memory is required to generate wiki_dialog?

TheExGenesis opened this issue · comments

I've been running out of memory generating wiki_dialog on machines with 88GB RAM, whereas the dataset is supposed to be only ~37GB.

@TheExGenesis Sorry, I haven't measured this. I imagine the Few-shot creation is the culprit.

If I have time later this week, I can look into capping the dataset, or having it sample from a subset of few-shot exemplars. As a work around you could also download ~100k examples and just have the Wiki Dialog Task Configs read from that file rather than work with all 13M examples, and I don't think you'd really lose anything.

To do that, you would edit this code to read from a tsv file like the CoT datasets do here.

@shayne-longpre How would you download only a subset of the examples? I'm doing tfds.load("wiki_dialog:1.0.0", with_info=True, split="validation[:1%]") which should be 1k examples and it's taking a long time and is going on 30GBs of RAM anyway, which doesn't seem reasonable.

Given I'm only trying to download with tfds, I don't think few-shot creation is to blame.

@TheExGenesis As you pointed out, WikiDialog is 13M examples and 37GB of dialog data. This is a quirk of Tensorflow Datasets deciding to download everything then do the 1% sample. I would recommend following my recommendation of downloading the dataset yourself, subsetting it, then changing the reader function following this: "To do that, you would edit this code to read from a tsv file like the CoT datasets do here."

I managed to do it in a 300GB machine

this is a hack but if the data loading is changed from apache beam to regular python (< 5 lines of code change), it run much more easily on single machine.