GanjinZero / CODER

CODER: Knowledge infused cross-lingual medical term embedding for term normalization. [JBI, ACL-BioNLP 2022]

Home Page:https://www.sciencedirect.com/science/article/pii/S1532046421003129

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Impact of padding strategy on CODER embeddings

mgh1 opened this issue · comments

commented

Dear Authors,

Thank you for the great work!

I was reviewing the code and noticed that the way you extract embeddings is a bit different than what is typically done in terms of padding to the max len (32 tokens). Normally I don't see others who extract embeddings to do it this way, e.g., they just tokenize and pass the inputs through the model.

I tried experimenting with different token lengths and using and not using any additional padding. The results are that there are significant differences in the ultimate cosine similarity scores between the embeddings depending on whether padding is used, and how much of it is (e.g., what the max token length is).

I re-read your coder papers and didn't find anything about padding, nor could I find anything more in this repo. Can you explain why you chose the padding strategy you did? Have you experimented with not using or adjusting the amount of padding and its impact ultimately on cosine similarity between embeddings and overall performance?

This is my early work, so I use a little weird padding strategy. I think medical terms are usually short, and use like 32 tokens are enough. The best way to use CODER is to use the same padding length as my pretraining.
We have fix this padding strategy to normal strategy in our CODER++ paper.

The main reason is we miss attention_mask in pretraining, and we do not have enough resources to re-train our model at that time.

commented

We have fix this padding strategy to normal strategy in our CODER++ paper.

Can you please show the code that fixes this? I do see there are two nearly identical code for generate_faiss_index.py, is it one of these or somewhere else?

We have attention_mask in https://github.com/GanjinZero/CODER/blob/master/coderpp/train/model.py (i.e. pretraining); the inference code in faiss may not be the latest code.
If you are using coder++ for yourself, you can pad to any length and use CLS representation as term representation.

commented

That's helpful! I was actually using the old coder model (coder_eng) because I found it actually had higher performance than coder_eng_pp. But it may be because I was not performing inference in the right way. I may have been performing inference which is best for coder_eng, but not coder_eng_pp.

coder_eng can be used for entity linking/entity disambiguation/entity normalization; while coder_eng_pp is be used for entity clustering.

commented

@GanjinZero , based on your feedback I believe I may have the inferencing correctly now for both coder_eng_pp and cocer_eng, but to be sure, can you please double check if these cosine similarities look right? I used the examples from both of your papers in these tests.

model=coder_eng_pp; summary_method=CLS:

{"'Type 1 Diabetes' vs. 'Type 2 Diabetes'": 0.6157483, "'xyloglucan endotransglycosylase' vs. 'xyloglucan endoglucanas'": 0.59852695, "'poisoned by eating pufferfish' vs. 'food poisoning'": 0.23570026, "'rheumatoid arthritis' vs. 'osteoarthritis'": 0.32876152, "'rheumatoid arthritis' vs. 'rheumatoid pleuritis'": 0.14604773}

model=coder_eng_pp; summary_method=MEAN:

{"'Type 1 Diabetes' vs. 'Type 2 Diabetes'": 0.4327795, "'xyloglucan endotransglycosylase' vs. 'xyloglucan endoglucanas'": 0.6181985, "'poisoned by eating pufferfish' vs. 'food poisoning'": 0.3112754, "'rheumatoid arthritis' vs. 'osteoarthritis'": 0.47793296, "'rheumatoid arthritis' vs. 'rheumatoid pleuritis'": 0.2677078}

model=coder_eng; summary_method=CLS:

{"'Type 1 Diabetes' vs. 'Type 2 Diabetes'": 0.8112792, "'xyloglucan endotransglycosylase' vs. 'xyloglucan endoglucanas'": 0.9539995, "'poisoned by eating pufferfish' vs. 'food poisoning'": 0.62217593, "'rheumatoid arthritis' vs. 'osteoarthritis'": 0.7335832, "'rheumatoid arthritis' vs. 'rheumatoid pleuritis'": 0.7832214}

model=coder_eng; summary_method=MEAN:

{"'Type 1 Diabetes' vs. 'Type 2 Diabetes'": 0.86106145, "'xyloglucan endotransglycosylase' vs. 'xyloglucan endoglucanas'": 0.96454996, "'poisoned by eating pufferfish' vs. 'food poisoning'": 0.73919964, "'rheumatoid arthritis' vs. 'osteoarthritis'": 0.8229953, "'rheumatoid arthritis' vs. 'rheumatoid pleuritis'": 0.83531755}

it seems ok

commented

Thank you!