vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What's the meaning of sent_id and part_id in this dataset?

wbqhb opened this issue · comments

commented

What's the meaning of sent_id and part_id in this dataset?

Hi!

Part_id is used for debugging purposes, as there can be multiple documents of the same name and different part_id. Also, the expected conll output format contains the part_id column, so this information has to be preserved.

Sent_id is used in two places:

  1. When splitting the document into windows of 512 subtokens max, we don't want to split in the middle of a sentence.
  2. When predicting spans from head words, we only consider possible boundaries within the same sentence.
commented

Thank you for your reply.

commented

Another question, I am a novice in pytorch. My single GPU memory is not enough. How to run this code in parallel with multi-gpus, please?

Do you have memory issues when training or when evaluating?
How much GPU memory do you have?

I'm asking because there's no simple solution to run the code with multiple gpus, but depending on your needs, I could suggest some approaches that might be helpful.

commented

I want to train your model by adding my idea, I have ten GPUs with 16GB memory.

图片

Actually, I don't know where to add this to your code.

I would try wrapping model.bert and model.a_scorer in this function. They are the most memory intensive modules on training.

Alternatively, you can place them on different devices (using .to method).

Another way would be to try and use LMS

commented

Thank you for your help. I will have a try.