[Bug] CUDA out-of-memory error on hard entity mining

Question

[Bug] CUDA out-of-memory error on hard entity mining

ruanchaves opened this issue 4 years ago · comments

Issue description

Performing the standard experiment on README.md results in a CUDA out of memory error.

Although the README.md says ~3 GB CPU and ~1.1GB GPU are necessary for running script., I get this error on a machine with 32GB of GPU and 500GB of CPU.

Furthermore, I get this same error even while using two 32 GB GPUs ( totalling 64 GB ) through the command CUDA_VISIBLE_DEVICES=0,1 python3 ./src/train.py -num_epochs 1 -cuda_devices 0,1

Steps to reproduce the issue

Start a virtualenv on a pytorch Docker container.

docker run --gpus all -it pytorch/pytorch
python -m venv venv
source venv/bin/activate

Install dependencies.

torch==1.4.0
transformers==2.8.0
allennlp==0.9.0
faiss-gpu==1.6.3

Run the commands on README.md .

git clone https://github.com/izuna385/Zero-Shot-Entity-Linking.git
cd Zero-Shot-Entity-Linking
sh preprocessing.sh  # ~3 min
python3 ./src/train.py -num_epochs 1

What's the expected result?

To perform the experiment started by the command python3 ./src/train.py -num_epochs 1.

What's the actual result?

On a 32GB NVIDIA V100 GPU: RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 31.72 GiB total capacity; 30.05 GiB already allocated; 81.81 MiB free; 30.58 GiB reserved in total by PyTorch)

Full traceback ( single gpu )

(venv) root@1a9122ab5fe4:/home/repositories/Zero-Shot-Entity-Linking# python3 ./src/train.py -num_epochs 1
===experiment starts===

===PARAMETERS===
debug False
bert_name bert-base-uncased
word_embedding_dropout 0.05
cuda_devices 0
allen_lazyload True
batch_size_for_train 32
batch_size_for_eval 8
hard_negatives_num 10
num_epochs 1
lr 1e-05
weight_decay 0
beta1 0.9
beta2 0.999
epsilon 1e-08
amsgrad False
max_title_len 12
max_desc_len 50
max_context_len_after_tokenize 100
add_mse_for_biencoder False
search_method indexflatip
add_hard_negatives True
metionPooling CLS
entityPooling CLS
dimentionReduction False
dimentionReductionToThisDim 300
extracted_first_token_for_description 100
extracted_first_token_for_title 16
dataset_dir ./data/
documents_dir ./data/documents/
mentions_dir ./data/mentions/
mentions_splitbyworld_dir ./data/mentions_split_by_world/
mention_leftandright_tokenwindowwidth 40
debugSampleNum 100000000
dir_for_each_world ./data/worlds/
experiment_logdir ./src/experiment_logdir/
===PARAMETERS END===

experiment_logdir: ./src/experiment_logdir/200817_040200/
World american_football is now being loaded...
  0%|          | 0/1 [00:00<?, ?it/s]======Encoding all entites from title and description=====
100%|##########| 31929/31929 [03:44<00:00, 141.96it/s]
250it [03:45,  1.11it/s]1929 [03:44<00:00, 141.88it/s]

########
HARD NEGATIVE MININGS started
########

100%|##########| 3898/3898 [02:45<00:00, 23.53it/s]
488it [02:45,  2.94it/s]98 [02:45<00:00, 19.44it/s]
  0%|          | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last)::42, 13.68it/s]
  File "./src/train.py", line 131, in <module>
    main()
  File "./src/train.py", line 76, in main
    trainer.train()
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 256, in batch_loss
    output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/training/util.py", line 331, in data_parallel
    outputs = parallel_apply(replicas, inputs, moved, used_device_ids)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/src/model.py", line 62, in forward
    encoded_entities_from_hard_negatives_idx0isgold = self.entity_encoder(docked_tokenlist).view(batch_, gold_plus_negs_num, -1)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/src/encoders.py", line 41, in forward
    entity_emb = self.word_embedder(title_and_desc_concatnated_text)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward
    token_vectors = embedder(*tensors, **forward_params_values)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 26, in forward
    return self.transformer_model(token_ids)[0]
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 715, in forward
    head_mask=head_mask)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 437, in forward
    layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 417, in forward
    intermediate_output = self.intermediate(attention_output)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 389, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 142, in gelu
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 31.72 GiB total capacity; 30.05 GiB already allocated; 81.81 MiB free; 30.58 GiB reserved in total by PyTorch)

  1%|          | 31/3898 [00:03<06:44,  9.56it/s]

Full traceback ( multi-gpu )

(venv) root@a236a2cd8f06:/home/repositories/Zero-Shot-Entity-Linking# CUDA_VISIBLE_DEVICES=0,1 python3 ./src/train.py -num_epochs 1 -cuda_devices 0,1
===experiment starts===

===PARAMETERS===
debug False
bert_name bert-base-uncased
word_embedding_dropout 0.05
cuda_devices 0,1
allen_lazyload True
batch_size_for_train 32
batch_size_for_eval 8
hard_negatives_num 10
num_epochs 1
lr 1e-05
weight_decay 0
beta1 0.9
beta2 0.999
epsilon 1e-08
amsgrad False
max_title_len 12
max_desc_len 50
max_context_len_after_tokenize 100
add_mse_for_biencoder False
search_method indexflatip
add_hard_negatives True
metionPooling CLS
entityPooling CLS
dimentionReduction False
dimentionReductionToThisDim 300
extracted_first_token_for_description 100
extracted_first_token_for_title 16
dataset_dir ./data/
documents_dir ./data/documents/
mentions_dir ./data/mentions/
mentions_splitbyworld_dir ./data/mentions_split_by_world/
mention_leftandright_tokenwindowwidth 40
debugSampleNum 100000000
dir_for_each_world ./data/worlds/
experiment_logdir ./src/experiment_logdir/
===PARAMETERS END===

experiment_logdir: ./src/experiment_logdir/200817_043656/
100%|##########| 433/433 [00:00<00:00, 277696.27B/s]
100%|##########| 440473133/440473133 [01:12<00:00, 6085356.67B/s] 
100%|##########| 231508/231508 [00:00<00:00, 448844.71B/s]
100%|##########| 407873900/407873900 [01:17<00:00, 5267746.95B/s] 
World american_football is now being loaded...
  0%|          | 0/1 [00:00<?, ?it/s]======Encoding all entites from title and description=====
100%|##########| 31929/31929 [03:33<00:00, 149.64it/s]
250it [03:33,  1.17it/s]1929 [03:33<00:00, 153.28it/s]

########
HARD NEGATIVE MININGS started
########

100%|##########| 3898/3898 [02:34<00:00, 25.24it/s]
488it [02:34,  3.16it/s]98 [02:34<00:00, 24.44it/s]
  0%|          | 0/1 [00:11<?, ?it/s]
Traceback (most recent call last)::00, 15.92it/s]
  File "./src/train.py", line 131, in <module>
    main()
  File "./src/train.py", line 76, in main
    trainer.train()
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/training/trainer.py", line 256, in batch_loss
    output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/training/util.py", line 331, in data_parallel
    outputs = parallel_apply(replicas, inputs, moved, used_device_ids)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/src/model.py", line 62, in forward
    encoded_entities_from_hard_negatives_idx0isgold = self.entity_encoder(docked_tokenlist).view(batch_, gold_plus_negs_num, -1)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/src/encoders.py", line 41, in forward
    entity_emb = self.word_embedder(title_and_desc_concatnated_text)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward
    token_vectors = embedder(*tensors, **forward_params_values)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 26, in forward
    return self.transformer_model(token_ids)[0]
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 715, in forward
    head_mask=head_mask)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 437, in forward
    layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 417, in forward
    intermediate_output = self.intermediate(attention_output)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 389, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/home/repositories/Zero-Shot-Entity-Linking/venv/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 142, in gelu
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 31.72 GiB total capacity; 30.05 GiB already allocated; 57.81 MiB free; 30.58 GiB reserved in total by PyTorch)

  2%|1         | 63/3898 [00:11<11:38,  5.49it/s]

Ruan Chaves · Answer 1 · Mon Aug 17 2020 12:18:37 GMT+0800 (China Standard Time)

I got around the error by reducing the batch_size_for_train from 32 down to 8. Then I was able to run ./src/train.py as expected on my setup.

Maybe it would be a good idea to put on README.md
python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1

and, similarily,

CUDA_VISIBLE_DEVICES=0,1 python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1 -cuda_devices 0,1

As it is said that ~3 GB CPU and ~1.1GB GPU are necessary for running script. .

Phenix · Answer 2 · Fri Nov 20 2020 11:08:09 GMT+0800 (China Standard Time)

I have the same problem，It seems to use a single GPU training mode in encoders.
e.g. self.cuda_device = 0 and batch = nn_util.move_to_device(batch, self.cuda_device)
So even if use the CUDA_VISIBLE_DEVICES=0,1, In fact it used a singal GPU.
I want to konw how to use allennlp with multi GPU.Can you please help?

Ruan Chaves · Answer 3 · Fri Nov 20 2020 11:46:48 GMT+0800 (China Standard Time)

Only BiEncoderTopXRetriever at utils.py uses a single GPU.
train.py calls Trainer which uses all available GPUs by default.

I executed the command CUDA_VISIBLE_DEVICES=0,1 python3 ./src/train.py -num_epochs 1 -batch_size_for_train 1 -batch_size_for_eval 1 -cuda_devices 0,1 and then I had no problem training the model on multiple GPUs.

Did you forget to add -cuda_devices 0,1 at the end of your command?

Phenix · Answer 4 · Fri Nov 20 2020 12:00:50 GMT+0800 (China Standard Time)

I used this command
CUDA_VISIBLE_DEVICES=3,4,5 python3 train.py -num_epochs 1 -batch_size_for_train 8 -batch_size_for_eval 8 -cuda_devices 3,4,5

and Problem arise when Encoding all entites from title and description
experiment_logdir: ../src/experiment_logdir/201120_125315/ World american_football is now being loaded... 0%| | 0/1 [00:00<?, ?it/s]======Encoding all entites from title and description===== 0%| | 0/1 [00:13<?, ?it/s] Traceback (most recent call last):<01:03, 423.30it/s] File "train.py", line 190, in <module> main() File "train.py", line 83, in main hardNegativeSearcher.hardNegativesSearcherandSetter() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/hardnegative_searcher.py", line 41, in hardNegativesSearcherandSetter dui2encoded_emb, duidx2encoded_emb = self.dui2EncoderEntityEmbReturner() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/hardnegative_searcher.py", line 76, in dui2EncoderEntityEmbReturner duidx2encoded_emb = self.encodeAllEntitiesEncoder.encoding_all_entities() File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 129, in encoding_all_entities duidxs, embs = self._extract_cuidx_and_its_encoded_emb(batch) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 141, in _extract_cuidx_and_its_encoded_emb out_dict = self.entity_encoder_wrapping_model(**batch) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/model.py", line 108, in forward encoded_entites = self.entity_encoder(title_and_desc_concatnated_text=title_and_desc_concatnated_text) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/zqx/Zero-Shot-Entity-Linking-master/src/encoders.py", line 46, in forward entity_emb = self.word_embedder(title_and_desc_concatnated_text) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward token_vectors = embedder(*tensors, **forward_params_values) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 26, in forward return self.transformer_model(token_ids)[0] File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 715, in forward head_mask=head_mask) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 437, in forward layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i]) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 417, in forward intermediate_output = self.intermediate(attention_output) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 389, in forward hidden_states = self.intermediate_act_fn(hidden_states) File "/home/zhg/anaconda3/envs/zsel/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 142, in gelu return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) RuntimeError: CUDA out of memory. Tried to allocate 3.72 GiB (GPU 0; 10.76 GiB total capacity; 9.73 GiB already allocated; 143.12 MiB free; 9.77 GiB reserved in total by PyTorch) 16%|#5 | 4999/31929 [00:13<01:14, 363.16it/s]

the code in encoders.py seems use singal GPU

Ruan Chaves · Answer 5 · Fri Nov 20 2020 13:37:34 GMT+0800 (China Standard Time)

There are two workarounds:

Use Docker containers. Execute docker run with the flag --gpus '"device=3,4,5"'. In this way the GPUs 3, 4 and 5 will be mapped to 0, 1 and 2 inside your container. More information here.
If Docker containers are not available on your machine or if you are not familiar with Docker, you can simply do:

Replace self.cuda_device = 0 on line 201 of utils.py with self.cuda_device = 3
Replace self.cuda_device = 0 on line 107 of encoders.py with self.cuda_device = 3

Phenix · Answer 6 · Fri Nov 20 2020 14:23:22 GMT+0800 (China Standard Time)

Thank you for your help

DRosemei · Answer 7 · Thu Dec 17 2020 09:36:51 GMT+0800 (China Standard Time)

100%|##########| 440473133/440473133 [01:12<00:00, 6085356.67B/s]

Hello, I want to know why your speed is so fast. Mine is shown below.
===PARAMETERS===
debug False
bert_name bert-base-uncased
word_embedding_dropout 0.05
cuda_devices 0
allen_lazyload True
batch_size_for_train 32
batch_size_for_eval 8
hard_negatives_num 10
num_epochs 1
lr 1e-05
weight_decay 0
beta1 0.9
beta2 0.999
epsilon 1e-08
amsgrad False
max_title_len 12
max_desc_len 50
max_context_len_after_tokenize 100
add_mse_for_biencoder False
search_method indexflatip
add_hard_negatives True
metionPooling CLS
entityPooling CLS
dimentionReduction False
dimentionReductionToThisDim 300
extracted_first_token_for_description 100
extracted_first_token_for_title 16
dataset_dir ./data/
documents_dir ./data/documents/
mentions_dir ./data/mentions/
mentions_splitbyworld_dir ./data/mentions_split_by_world/
mention_leftandright_tokenwindowwidth 40
debugSampleNum 100000000
dir_for_each_world ./data/worlds/
experiment_logdir ./src/experiment_logdir/
===PARAMETERS END===

experiment_logdir: ./src/experiment_logdir/201217_102331/
61%|##########################################3 | 266586112/440473133 [12:25<02:27, 1178137.12B/s]

Is it depend on CPUs?

Ruan Chaves · Answer 8 · Thu Dec 17 2020 10:01:11 GMT+0800 (China Standard Time)

I ran these experiments on two Tesla V100 GPUs at a NVIDIA DGX-1 32GB Server.
So yes, it depends on your setup.

By the way, @DRosemei and @doudouzqx , please let me know if you succeed in your experiments with the code on this repository or the BLINK repository. Although I was able to run the code and train the model, I couldn't achieve the results I was looking for.

DRosemei · Answer 9 · Thu Dec 17 2020 10:10:52 GMT+0800 (China Standard Time)

@ruanchaves I meet a trouble now. I have downloaded model named "bert-base-uncased" , but I don't know where to put it.
Errors are shown below:
Model name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz' was a path or url but couldn't find any file associated to this path or url.
Traceback (most recent call last):
File "./src/train.py", line 131, in
main()
File "./src/train.py", line 44, in main
mention_encoder = Pooler_for_mention(args=opts, word_embedder=textfieldEmbedder)
File "/media/rose/Doc/projects/xiaofan/Zero-Shot-Entity-Linking/src/encoders.py", line 63, in init
self.bertpooler_sec2vec = BertPooler(pretrained_model=self.bert_weight_filepath)
File "/home/rose/anaconda3/envs/el/lib/python3.7/site-packages/allennlp/modules/seq2vec_encoders/bert_pooler.py", line 51, in init
self.pooler = model.pooler
AttributeError: 'NoneType' object has no attribute 'pooler'

Ruan Chaves · Answer 10 · Thu Dec 17 2020 10:15:56 GMT+0800 (China Standard Time)

@DRosemei Can you post the command you are trying to run? What are your arguments to python3 ./src/train.py ?

DRosemei · Answer 11 · Thu Dec 17 2020 10:40:36 GMT+0800 (China Standard Time)

@ruanchaves Yes, I used python3 ./src/train.py -num_epochs 1 , and I could train it now after I put "bert-base-uncased" to ./src/

DRosemei · Answer 12 · Thu Dec 17 2020 15:31:44 GMT+0800 (China Standard Time)

@ruanchaves I have completed 1 epoch, and I get final results below:
{
"entire_h1_percent": 20.28,
"entire_h10_percent": 42.88,
"entire_h50_percent": 54.42,
"entire_h64_percent": 55.96,
"entire_h100_percent": 59.440000000000005,
"entire_h500_percent": 71.00999999999999
}
The results are not so good. Have you ever trained more than 1 epoch?

Ruan Chaves · Answer 13 · Thu Dec 17 2020 19:11:56 GMT+0800 (China Standard Time)

Yes, I have already trained for some epochs. I couldn't achieve acceptable results.