Preprocessing step - memory consumption
AndresFRJ98 opened this issue · comments
Hello,
I am trying to get the intial repo to work by following the steps provided. In the preprocessing step, upon running:
python main.py --mode prepro --data_file hotpot_train_v1.1.json --para_limit 2250 --data_split train
The pre-processing begins. However, the memory consumption on my machine becomes enormous (using up to 10gb), so I decided to terminate it. How many 'tasks' (as it displays while running) does the preprocessing have to go through?
Is this amount of memory consumption normal or is something wrong with my setup/enviornment?
Thanks
That amount of memory consumption is expected. You can reduce it by processing fewer tasks at a time (n_jobs
in Parallel()
). There's one "task" for each example in the dataset, so depending on which split you're processing, it could be ~90k or ~8k.