There are 1 repository under safe-rlhf topic.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
Reading list for adversarial perspective and robustness in deep reinforcement learning.