reproducing your results

Question

reproducing your results

patrickocal opened this issue 10 months ago · comments

Patrick OCallaghan commented 10 months ago

Hi folks, thanks for your help with understanding unlimiformer so far. My team and I trying to reproduce your training results from the paper using the following:

python src/run.py \                                                           
    src/configs/training/base_training_args.json \                                 
    src/configs/data/gov_report.json \                                             
    --output_dir output_train_bart_base_local/ \                                   
    --learning_rate 1e-5 \                                                         
    --model_name_or_path facebook/bart-base \                                      
    --eval_steps 1000 --save_steps 1000 \                                          
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \               
    --extra_metrics bertscore \                                                    
    --unlimiformer_training \                                                      
    --max_source_length 16384 \                                                    
    --test_unlimiformer --eval_max_source_length 999999  --do_eval=True \          
    > output/output${SLURM_JOB_ID}.txt

My understanding is that we should be reproducing Table 4: (56.6 / 26.3 / 27.6 / 68.2) for (Rouge 1 / 2/ L/ BERTScore). Here is a link to a wandb report of a full run we have produced (it took about 11 hours):
https://api.wandb.ai/links/unlimiformer-kg/y29tbk1n

The max_source_length 16384 is a concern given that the training set has some enormous documents. The dataset has a very long tail: plenty over 50k tokens and even one with 250k tokens.

I'll let you know how a second run goes overnight. (I've just cloned your repo and, just to be sure, here is a screen shot of my slurm job:

Patrick OCallaghan · Answer 1 · Mon Nov 27 2023 05:01:47 GMT+0800 (China Standard Time)

Hi folks,

Here is the output from the latest run (though it was with an older (like 4 weeks old) clone of your repo).

I've just started a new run with the latest clone:

Also, I've confirmed there is no issue with the test set: assuming your reported results (in table 3 of the paper) are for the evaluation set, right?

Patrick OCallaghan · Answer 2 · Tue Nov 28 2023 05:40:35 GMT+0800 (China Standard Time)

The default config must be that the model picks up an existing training schedule where it left off. This would explain the lack of learning (improvement) in this training report (since I had some prior runs within the same folder: https://api.wandb.ai/links/unlimiformer-kg/y29tbk1n

Here is the current run which looks much better (and it is still improving):
https://api.wandb.ai/links/unlimiformer-kg/dzrhchh8
While the Bertscores are looking like their in the right ball park, the Rouge scores are much less so. I'll post the final output once it arrives.

Patrick OCallaghan · Answer 3 · Tue Nov 28 2023 13:07:24 GMT+0800 (China Standard Time)

https://api.wandb.ai/links/unlimiformer-kg/0ac7j17w

In any case, my main concern about the learning going on is basically solved, but I still can't replicate your results. I assume your results are indeed on the basis of the test set? Any advice on this would be helpful.

Patrick OCallaghan · Answer 4 · Tue Dec 12 2023 15:24:12 GMT+0800 (China Standard Time)

I've found that the key to getting near your results on the evaluation set is the length of the generated summary of the Long Document (LD).

I found that the default unlimiformer training settings (as above after simply cloning your repo) lead to small summaries of 70-130 words for the GovReport dataset. Unlimiformer did improve on no-unlimiformer-bart-base and that's great, but apart from high precision, because they were so small relative to target (400-1000, with a few exceptions), the summaries were meaningless. Note that BART didn't learn to generate larger summaries: it would start high and then drop (or low and then rise), but it would consistently converge to around 120-130 on average.

I hope you don't mind me explaining a little more about my findings.

My team and I have generated knowledge graphs for each example in the GovReport dataset (https://huggingface.co/datasets/patrickocal/gov_report_kg). We trained bart-base (with unlimiformer enabled) with KGs as input and, as a third experiment, KG and LD combined. Note that KGs are fed into the model as a single string of sequences <s> rel1 </s><s> rel2 </s> ... <s> relN </s>. (In the combined case, we concatenated the KG followed by the LD into one string: Shakespeare style.)

The KG (and KG+LD) experiments resulted in significantly longer summaries (600-1000, and just under 900 on average). R1 was double the unlimiformer baseline at 40, but otherwise R2/RL/BF1 scores were pretty similar across the board. (Summary generation was significantly slower.)

I think this surprising difference has something to do with bart-base and its treatment of BOS and EOS tokens.

But: I also found that length of summary is highly dependent on the conda environment. (I didn't plan to run these experiments, but my original conda environment was somehow corrupted and I didn't realise that there was a conda_environment.yaml file in the wandb directory. Still: evolution often arises through error.)

So now I have three conda environments: two with transformers 1.34.1 and one with transformers 1.35.2. (There are other differences and I am happy to share.) In the transformers 1.35.2 environment (which I had created it without any of the constraints in your requirements.txt file) is that all summaries became short. If I constrained (using min_new_tokens) to be greater than 130, then it would converge to the minimum (with or without unlimiformer and with or without KGs).

For the two transformers 1.34.1 environments, one has pytorch 2.1 and the other has pytorch 1.12. What I found is that if I started training with pytorch 2.1, and then after about 18k steps I switch to the pytorch 1.12 environment, you get a jump in performance. BERTScore/F1 jumps from 60 to 65 and ROUGE/geometric_mean jumps from 21 to 30!
https://wandb.ai/unlimiformer-kg/unlimiformer-07-dec-src/reports/unlimiformer_kg_comb--Vmlldzo2MjM4MDUy?accessToken=fi7384z9jrz212aed0lt6b4jpwp1d677tghit15xkds9sb32ecdmff3p3u0dfnt0

In any case, as a result, I have very nearly matched your results with this bizarre training process. The only thing that remains is to run the new model on the test set and submit to Scrolls.

@urialon and @abertsch72 and team: your insights would be very welcome!

Patrick OCallaghan · Answer 5 · Tue Dec 12 2023 15:35:13 GMT+0800 (China Standard Time)

PS. micromamba for resolving conda environments: it just rocks.

Patrick OCallaghan · Answer 6 · Tue Dec 12 2023 17:36:42 GMT+0800 (China Standard Time)

I think I now understand: I think it is the add_special_tokens parameter. If this is set to False, then the training process slows right down in some environments, but not others.
In my initial envrironment, it made little difference: both KG and KG+LD run slowly and result in large summaries (regardless of this parameter's value).
In the new environment, the value of add_special_tokens seems to matter alot. When this is False, training slows right down and there is a strong improvement across the board for KG+LD combined. I now am running the corresponding experiments for LD only and KG only ... (starting from the same LD+KG checkpoint to make it a fair horse race and to see if the long summaries continue to be generated).

Amanda Bertsch · Answer 7 · Tue Jan 30 2024 02:20:30 GMT+0800 (China Standard Time)

Hey @patrickocal -- apologies for the lack of response earlier on this, but this is a really interesting thread. Your knowledge graph setting is cool-- how are you generating these knowledge graphs?

That jump in performance from swapping pytorch is really wild-- I wonder if this is a general issue (did you see this with bart-base as well)?