What's the meaning of sent_id and part_id in this dataset?
wbqhb opened this issue · comments
What's the meaning of sent_id and part_id in this dataset?
Hi!
Part_id is used for debugging purposes, as there can be multiple documents of the same name and different part_id. Also, the expected conll output format contains the part_id column, so this information has to be preserved.
Sent_id is used in two places:
- When splitting the document into windows of 512 subtokens max, we don't want to split in the middle of a sentence.
- When predicting spans from head words, we only consider possible boundaries within the same sentence.
Thank you for your reply.
Another question, I am a novice in pytorch. My single GPU memory is not enough. How to run this code in parallel with multi-gpus, please?
Do you have memory issues when training or when evaluating?
How much GPU memory do you have?
I'm asking because there's no simple solution to run the code with multiple gpus, but depending on your needs, I could suggest some approaches that might be helpful.
I would try wrapping model.bert and model.a_scorer in this function. They are the most memory intensive modules on training.
Alternatively, you can place them on different devices (using .to method).
Another way would be to try and use LMS
Thank you for your help. I will have a try.