What's the meaning of sent_id and part_id in this dataset?

Question

What's the meaning of sent_id and part_id in this dataset?

wbqhb opened this issue 3 years ago · comments

vdobrovolskii · Answer 1 · Sun Nov 28 2021 15:11:36 GMT+0800 (China Standard Time)

Hi!

Part_id is used for debugging purposes, as there can be multiple documents of the same name and different part_id. Also, the expected conll output format contains the part_id column, so this information has to be preserved.

Sent_id is used in two places:

When splitting the document into windows of 512 subtokens max, we don't want to split in the middle of a sentence.
When predicting spans from head words, we only consider possible boundaries within the same sentence.

wbqhb · Answer 2 · Sun Nov 28 2021 23:28:51 GMT+0800 (China Standard Time)

Thank you for your reply.

wbqhb · Answer 3 · Sun Nov 28 2021 23:39:46 GMT+0800 (China Standard Time)

Another question, I am a novice in pytorch. My single GPU memory is not enough. How to run this code in parallel with multi-gpus, please?

vdobrovolskii · Answer 4 · Mon Nov 29 2021 00:58:32 GMT+0800 (China Standard Time)

Do you have memory issues when training or when evaluating?
How much GPU memory do you have?

I'm asking because there's no simple solution to run the code with multiple gpus, but depending on your needs, I could suggest some approaches that might be helpful.

wbqhb · Answer 5 · Mon Nov 29 2021 09:12:57 GMT+0800 (China Standard Time)

I want to train your model by adding my idea, I have ten GPUs with 16GB memory.

Actually, I don't know where to add this to your code.

vdobrovolskii · Answer 6 · Mon Nov 29 2021 15:31:17 GMT+0800 (China Standard Time)

I would try wrapping model.bert and model.a_scorer in this function. They are the most memory intensive modules on training.

Alternatively, you can place them on different devices (using .to method).

Another way would be to try and use LMS

wbqhb · Answer 7 · Mon Nov 29 2021 15:39:15 GMT+0800 (China Standard Time)

Thank you for your help. I will have a try.