Impact of padding strategy on CODER embeddings

Question

Impact of padding strategy on CODER embeddings

mgh1 opened this issue a year ago · comments

Dear Authors,

Thank you for the great work!

I was reviewing the code and noticed that the way you extract embeddings is a bit different than what is typically done in terms of padding to the max len (32 tokens). Normally I don't see others who extract embeddings to do it this way, e.g., they just tokenize and pass the inputs through the model.

I tried experimenting with different token lengths and using and not using any additional padding. The results are that there are significant differences in the ultimate cosine similarity scores between the embeddings depending on whether padding is used, and how much of it is (e.g., what the max token length is).

I re-read your coder papers and didn't find anything about padding, nor could I find anything more in this repo. Can you explain why you chose the padding strategy you did? Have you experimented with not using or adjusting the amount of padding and its impact ultimately on cosine similarity between embeddings and overall performance?

Zheng Yuan · Answer 1 · Thu Feb 23 2023 16:40:26 GMT+0800 (China Standard Time)

This is my early work, so I use a little weird padding strategy. I think medical terms are usually short, and use like 32 tokens are enough. The best way to use CODER is to use the same padding length as my pretraining.
We have fix this padding strategy to normal strategy in our CODER++ paper.

Zheng Yuan · Answer 2 · Thu Feb 23 2023 16:41:53 GMT+0800 (China Standard Time)

The main reason is we miss attention_mask in pretraining, and we do not have enough resources to re-train our model at that time.

mgh1 · Answer 3 · Thu Feb 23 2023 16:58:43 GMT+0800 (China Standard Time)

We have fix this padding strategy to normal strategy in our CODER++ paper.

Can you please show the code that fixes this? I do see there are two nearly identical code for generate_faiss_index.py, is it one of these or somewhere else?

Zheng Yuan · Answer 4 · Thu Feb 23 2023 17:05:11 GMT+0800 (China Standard Time)

We have attention_mask in https://github.com/GanjinZero/CODER/blob/master/coderpp/train/model.py (i.e. pretraining); the inference code in faiss may not be the latest code.
If you are using coder++ for yourself, you can pad to any length and use CLS representation as term representation.

mgh1 · Answer 5 · Thu Feb 23 2023 17:23:03 GMT+0800 (China Standard Time)

That's helpful! I was actually using the old coder model (coder_eng) because I found it actually had higher performance than coder_eng_pp. But it may be because I was not performing inference in the right way. I may have been performing inference which is best for coder_eng, but not coder_eng_pp.

Zheng Yuan · Answer 6 · Thu Feb 23 2023 17:59:13 GMT+0800 (China Standard Time)

coder_eng can be used for entity linking/entity disambiguation/entity normalization; while coder_eng_pp is be used for entity clustering.

mgh1 · Answer 7 · Thu Feb 23 2023 20:25:49 GMT+0800 (China Standard Time)

@GanjinZero , based on your feedback I believe I may have the inferencing correctly now for both coder_eng_pp and cocer_eng, but to be sure, can you please double check if these cosine similarities look right? I used the examples from both of your papers in these tests.

model=`coder_eng_pp`; summary_method=`CLS`:

{"'Type 1 Diabetes' vs. 'Type 2 Diabetes'": 0.6157483, "'xyloglucan endotransglycosylase' vs. 'xyloglucan endoglucanas'": 0.59852695, "'poisoned by eating pufferfish' vs. 'food poisoning'": 0.23570026, "'rheumatoid arthritis' vs. 'osteoarthritis'": 0.32876152, "'rheumatoid arthritis' vs. 'rheumatoid pleuritis'": 0.14604773}

model=`coder_eng_pp`; summary_method=`MEAN`:

{"'Type 1 Diabetes' vs. 'Type 2 Diabetes'": 0.4327795, "'xyloglucan endotransglycosylase' vs. 'xyloglucan endoglucanas'": 0.6181985, "'poisoned by eating pufferfish' vs. 'food poisoning'": 0.3112754, "'rheumatoid arthritis' vs. 'osteoarthritis'": 0.47793296, "'rheumatoid arthritis' vs. 'rheumatoid pleuritis'": 0.2677078}

model=`coder_eng`; summary_method=`CLS`:

{"'Type 1 Diabetes' vs. 'Type 2 Diabetes'": 0.8112792, "'xyloglucan endotransglycosylase' vs. 'xyloglucan endoglucanas'": 0.9539995, "'poisoned by eating pufferfish' vs. 'food poisoning'": 0.62217593, "'rheumatoid arthritis' vs. 'osteoarthritis'": 0.7335832, "'rheumatoid arthritis' vs. 'rheumatoid pleuritis'": 0.7832214}

model=`coder_eng`; summary_method=`MEAN`:

{"'Type 1 Diabetes' vs. 'Type 2 Diabetes'": 0.86106145, "'xyloglucan endotransglycosylase' vs. 'xyloglucan endoglucanas'": 0.96454996, "'poisoned by eating pufferfish' vs. 'food poisoning'": 0.73919964, "'rheumatoid arthritis' vs. 'osteoarthritis'": 0.8229953, "'rheumatoid arthritis' vs. 'rheumatoid pleuritis'": 0.83531755}

Zheng Yuan · Answer 8 · Fri Feb 24 2023 10:33:16 GMT+0800 (China Standard Time)

it seems ok

mgh1 · Answer 9 · Fri Feb 24 2023 13:50:34 GMT+0800 (China Standard Time)

Thank you!

Impact of padding strategy on CODER embeddings

model=coder_eng_pp; summary_method=CLS:

model=coder_eng_pp; summary_method=MEAN:

model=coder_eng; summary_method=CLS:

model=coder_eng; summary_method=MEAN:

model=`coder_eng_pp`; summary_method=`CLS`:

model=`coder_eng_pp`; summary_method=`MEAN`:

model=`coder_eng`; summary_method=`CLS`:

model=`coder_eng`; summary_method=`MEAN`: