hotpotqa / hotpot

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Preprocessing step - memory consumption

AndresFRJ98 opened this issue · comments

Hello,

I am trying to get the intial repo to work by following the steps provided. In the preprocessing step, upon running:

python main.py --mode prepro --data_file hotpot_train_v1.1.json --para_limit 2250 --data_split train
The pre-processing begins. However, the memory consumption on my machine becomes enormous (using up to 10gb), so I decided to terminate it. How many 'tasks' (as it displays while running) does the preprocessing have to go through?

Is this amount of memory consumption normal or is something wrong with my setup/enviornment?

Thanks

That amount of memory consumption is expected. You can reduce it by processing fewer tasks at a time (n_jobs in Parallel()). There's one "task" for each example in the dataset, so depending on which split you're processing, it could be ~90k or ~8k.