There are 12 repositories under ai-safety topic.
A curated list of awesome responsible machine learning resources.
🐢 Open-Source Evaluation & Testing for LLMs and ML models
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Code accompanying the paper Pretraining Language Models with Human Preferences
📚 A curated list of papers & technical articles on AI Quality & Safety
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
Attack to induce LLMs within hallucinations
Feature Space Singularity for Out-of-Distribution Detection. (SafeAI 2021)
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
Reading list for adversarial perspective and robustness in deep reinforcement learning.
A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.
A curated list of awesome resources for getting-started-with and staying-in-touch-with Artificial Intelligence Alignment research.
A project to improve out-of-distribution detection (open set recognition) and uncertainty estimation by changing a few lines of code in your project! Perform efficient inferences (i.e., do not increase inference time) without repetitive model training, hyperparameter tuning, or collecting additional data.
Sparse probing paper full code.
Alpha principles for the ethical use of AI and Data Driven Technologies in Ontario | Proposition de principes pour une utilisation éthique des technologies axées sur les données en Ontario
Universal Neurons in GPT2 Language Models
Awesome PrivEx: Privacy-Preserving Explainable AI (PPXAI)
A compilation of AI safety ideas, problems, and solutions.
LAWLIA is an open-source computational legal framework designed to revolutionize legal reasoning and analysis. It combines the power of large language models with a structured syntactical grammar to facilitate precise legal assessments, truth values, and verdicts. LAWLIA is the future of computational jurisprudence