training data

Question

training data

41924076 opened this issue 4 months ago · comments

41924076 commented 4 months ago

hi, thank you for your nice work! Do you train your M2-BERT-128 ~ 32K (shown in the paper) on LOCO V0 or LOCO V1 training set?

Dan Fu · Answer 1 · Fri Mar 15 2024 02:43:46 GMT+0800 (China Standard Time)

The public checkpoints are trained on LoCoV0. New checkpoints on LoCoV1 will be coming out soon!

…

On Thu, Mar 14, 2024 at 1:41 PM 41924076 ***@***.***> wrote: hi, nice work! Do you train your M2-BERT-128 ~ 32K (shown in the paper) on LOCO V0 or LOCO V1 training set? — Reply to this email directly, view it on GitHub <#27>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDDIIXLR7J4RKJQRVWXDOTYYHOLXAVCNFSM6AAAAABEWR66JSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4DMOJSGQ2DKNQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

41924076 · Answer 2 · Tue Mar 19 2024 20:28:09 GMT+0800 (China Standard Time)

thank you for your reply!

41924076 · Answer 3 · Fri Mar 29 2024 23:33:55 GMT+0800 (China Standard Time)

Hello, you mentioned in the paper: "For all M2-BERT configurations, we use a learning rate of 5e − 6, a true batch size of 32, 1 epoch of fine-tuning, a maximum gradient norm of 1.0, and a ratio of 32 negative passages per query-positive passage pair."

do you use random neg or bm25 hard neg?

when computing OPL loss in each step, do you use only similarity(1 query and 1 passage) , where the passage is one from 33 passage, including 32 negative passages and 1 positive passage?
or do you use both similarity(1 query and 1 positive passage) and similarity(1 query and 1 negative passage) when computing MSE in OPL loss in each step?

thank you so much!

Jon Saad-Falcon · Answer 4 · Sat Mar 30 2024 02:22:19 GMT+0800 (China Standard Time)

We use random negatives for fine-tuning. At each step for OPL, we calculate the similarity between the query and the positive passage as well as the similarity between the query and a negative passage.

41924076 · Answer 5 · Sun Mar 31 2024 13:44:06 GMT+0800 (China Standard Time)

thank you for your reply!

41924076 · Answer 6 · Tue Apr 09 2024 19:08:23 GMT+0800 (China Standard Time)

Hello, I find that it's a little bit hard to make qmsum score more than 45, do you use the whole LOCO V0 as training set with no duplication, deletion or certain data proportion? : )

Besides, does the public checkpoints only go through pretraining and LoCoV0 finetuning? : )

Jon Saad-Falcon · Answer 7 · Wed Apr 10 2024 00:57:39 GMT+0800 (China Standard Time)

We use all of LoCo V0 as our fine-tuning dataset with 32 negatives for every query-positive passage pair. For QMSUM and the other Tau Scrolls datasets on HuggingFace, we use the given train-validation-test split and evaluate on the validation split. The public checkpoints only go through pretraining and LoCoV0 fine-tuning. However, we plan to release an updated version of the QMSUM dataset (as well as several new datasets) in LoCoV1 soon!

41924076 · Answer 8 · Thu Apr 11 2024 15:54:41 GMT+0800 (China Standard Time)

Thank you for your reply! Have a nice day!

41924076 · Answer 9 · Tue May 07 2024 18:10:58 GMT+0800 (China Standard Time)

Hello, sorry to bother you. I'd like to try reproducing the training of m2bert on loco v0 or v1 with 2048 length.

According to your information and the paper, m2bert loco v0 was trained with OPL loss. However, the main branch on GitHub does not provide a training command with parameter value (https://github.com/HazyResearch/m2/blob/main/bert/EMBEDDINGS.md#training), and the code does not include OPL loss.

Additionally, I noticed that the jonsf branch (https://github.com/HazyResearch/m2/tree/jonsf-patch-1) includes OPL loss, but I'm unsure if the gather_loco_training_example in the jonsf branch is available for OPL loss.

The training script provided in the jonsf branch (https://github.com/HazyResearch/m2/blob/jonsf-patch-1/bert/EMBEDDINGS.md#training) also uses GradCache's multiple_negatives_ranking_loss instead of OPL, can this command produce good results similar to those mentioned in the paper?

If it's convenient for you, could you please provide a OPL (or multiple_negatives_ranking) loss training script that can roughly reproduce public checkpoint?

Jon Saad-Falcon · Answer 10 · Wed May 08 2024 01:16:15 GMT+0800 (China Standard Time)

Hello, in the jonsf branch, we include both OPL and multiple negatives ranking loss (MNRL) with grad caching. You can use either for training your own checkpoints of M2-BERT-2k.

We are currently exploring improved training techniques with both loss functions so we will be sure to share which turns out better! Thanks!

41924076 · Answer 11 · Wed May 08 2024 16:10:01 GMT+0800 (China Standard Time)

thank you for your reply!!!

Søren Møller Christensen · Answer 12 · Wed May 29 2024 20:47:42 GMT+0800 (China Standard Time)

Hi, thank you for publishing your results and sharing your training code!
It looks like you are importing orthogonal passage loss (OPL) as sentence_transformers.losses.OrthogonalPassageLoss.
However when i check sentence transformers they don't have this loss, so I'm assuming this is a fork of sentence_transformers, that you haven't shared?

From above and your paper it sounds like you are doing something like below pseduo_code

def opl_loss(model, query, documents, labels):
    q_embedding = model(query)
    d_embeddings = model(documents) # 1 positive + 32 randomly sampled negatives
    pairwise_cosine_sim = cosine_sim(q_embedding, d_embeddings)
    loss = mse(pairwise_cosine_sim, labels)
    return loss

Does this sound about right?

Jon Saad-Falcon · Answer 13 · Thu May 30 2024 01:09:18 GMT+0800 (China Standard Time)

Yes, we have a fork of the SentenceTransformers code base, in which we add orthogonal projection loss (OPL). We include the instructions for importing it in the M2 codebase but here is the link to the codebase.

Let me know if you have any further questions!

Søren Møller Christensen · Answer 14 · Thu May 30 2024 15:54:41 GMT+0800 (China Standard Time)

Thank you for the quicky reply. This is exactly what I was looking for.