Cornell-RL / tril

Hello, nice repo! I have some Qs about https://api.wandb.ai/links/coactivelearning/ga4r1uqd. Do you have some generation example of PPO trained on the tl;dr dataset?

Also when clicking on the run it gives me

Another question is how do you calculate the rouge score? Do you calculate the rouge score between the generated summary and the reference summary (excluding the prompt)?

I added PPO generations to the report. Please let us know if you would like us to add any other metrics to the report.

At the moment, we can't make the wandb run public because the wandb project contains logs for other runs pertaining to private projects that we are working on; and wandb does not support making a single run public but having a private project (at least I dont think so).

We plan to create a public wandb for results for algorithms in the repo moving forward, but have not got around to it yet.

Thanks for sharing this. Was curious about the rouge score calculation as well.

Also I noticed a lot of generated summaries has " :) :) :) :) :) :) :) :) :)". Do you have any idea why that's happening?

Hi! thanks for showing interest.

Rouge: we make use of HF rouge score between the generations and the references (

tril/src/tril/metrics/automated_metrics.py

Line 101 in 812b17e

class RougeMetric(BaseMetric):

)
Generations: The log posted it pretty far into training and we suspected that it was a form of reward hacking we started seeing.

Thanks! Can you also put the score, the non_score_reward (the sum of KL penalty) and total reward (score + non_score_reward)? corresponding to the following figure in the original paper.

@vwxyzjn I updated the report to include total_rewards, kl_rewards, and rewards.

Also, If you have time, Jonathan and I would be up for setting up Slack-Connect and/or a zoom meeting and chatting about a few things we have been running into. These issues mainly pertain to different base models for this task and using lora-adapters.

Hi @xkianteb thank you for sharing! Happy to set up slack connect. My company email is costa@huggingface.co. FYI at TRL we have also run into similar issues https://twitter.com/vwxyzjn/status/1705226977408389587. Training with different base models can indeed give you different results. With the sentiment tasks it's difficult to see if those different results will be meaningful, that's the reason I am working on summarization benchmark.

@vwxyzjn I agree 100%. We also concluded that the sentiment task is too easy to see a meaningful difference.

tl;dr ppo generation