[Question] How much memory is required to generate wiki_dialog?
TheExGenesis opened this issue · comments
I've been running out of memory generating wiki_dialog on machines with 88GB RAM, whereas the dataset is supposed to be only ~37GB.
@TheExGenesis Sorry, I haven't measured this. I imagine the Few-shot creation is the culprit.
If I have time later this week, I can look into capping the dataset, or having it sample from a subset of few-shot exemplars. As a work around you could also download ~100k examples and just have the Wiki Dialog Task Configs read from that file rather than work with all 13M examples, and I don't think you'd really lose anything.
To do that, you would edit this code to read from a tsv file like the CoT datasets do here.
@shayne-longpre How would you download only a subset of the examples? I'm doing tfds.load("wiki_dialog:1.0.0", with_info=True, split="validation[:1%]")
which should be 1k examples and it's taking a long time and is going on 30GBs of RAM anyway, which doesn't seem reasonable.
Given I'm only trying to download with tfds, I don't think few-shot creation is to blame.
@TheExGenesis As you pointed out, WikiDialog is 13M examples and 37GB of dialog data. This is a quirk of Tensorflow Datasets deciding to download everything then do the 1% sample. I would recommend following my recommendation of downloading the dataset yourself, subsetting it, then changing the reader function following this: "To do that, you would edit this code to read from a tsv file like the CoT datasets do here."
I managed to do it in a 300GB machine
this is a hack but if the data loading is changed from apache beam to regular python (< 5 lines of code change), it run much more easily on single machine.