[Question] How much memory is required to generate wiki_dialog?

Question

[Question] How much memory is required to generate wiki_dialog?

TheExGenesis opened this issue a year ago · comments

I've been running out of memory generating wiki_dialog on machines with 88GB RAM, whereas the dataset is supposed to be only ~37GB.

shayne-longpre · Answer 1 · Tue Mar 21 2023 03:21:26 GMT+0800 (China Standard Time)

@TheExGenesis Sorry, I haven't measured this. I imagine the Few-shot creation is the culprit.

If I have time later this week, I can look into capping the dataset, or having it sample from a subset of few-shot exemplars. As a work around you could also download ~100k examples and just have the Wiki Dialog Task Configs read from that file rather than work with all 13M examples, and I don't think you'd really lose anything.

To do that, you would edit this code to read from a tsv file like the CoT datasets do here.

Francisco Carvalho · Answer 2 · Tue Mar 21 2023 20:55:20 GMT+0800 (China Standard Time)

@shayne-longpre How would you download only a subset of the examples? I'm doing tfds.load("wiki_dialog:1.0.0", with_info=True, split="validation[:1%]") which should be 1k examples and it's taking a long time and is going on 30GBs of RAM anyway, which doesn't seem reasonable.

Given I'm only trying to download with tfds, I don't think few-shot creation is to blame.

shayne-longpre · Answer 3 · Tue Mar 21 2023 23:56:07 GMT+0800 (China Standard Time)

@TheExGenesis As you pointed out, WikiDialog is 13M examples and 37GB of dialog data. This is a quirk of Tensorflow Datasets deciding to download everything then do the 1% sample. I would recommend following my recommendation of downloading the dataset yourself, subsetting it, then changing the reader function following this: "To do that, you would edit this code to read from a tsv file like the CoT datasets do here."

Francisco Carvalho · Answer 4 · Wed Mar 22 2023 23:08:08 GMT+0800 (China Standard Time)

I managed to do it in a 300GB machine

Shijie Wu · Answer 5 · Tue Apr 18 2023 03:22:15 GMT+0800 (China Standard Time)

this is a hack but if the data loading is changed from apache beam to regular python (< 5 lines of code change), it run much more easily on single machine.