Red Teaming LLMs

This project aims to develop Haystack, an open-source platform for red teaming and human feedback on LLMs with crowd-sourcing and automated methods

Goals

build a frontend for Pythia model evaluation
develop a model evaluation interfaces for red teaming from scratch
develop a representational view of interpretability
organise open source human feedback data for sharing between models.

Challenges & Efforts

Despite there being many efforts to red team language models, there aren't any available open source frameworks for client-level user model evaluation and red teaming testing.

Built with

Next.js
tailwindcss
Deployed on Vercel

Red Teaming

Assessment Type	Description
Capabilities Assessment	Benchmark performance on representative tasks and datasets. Measure capabilities like accuracy, robustness, efficiency. Identify strengths, limitations, and gaps.
Adversarial Testing	Probe with malformed, adversarial inputs. Check for crashes, unintended behavior, security risks. Informed by threat models, risk analysis.
Red Teaming	Model potential real-world risks and failures. Role play adversary perspectives. Surface risks unique to AI.
Human Oversight	Manual test cases based on human judgment.

Attack Examples

Mosaic Prompt : breakdown a prompt into permissible components

Users break down impermissible content into small permissible components.
Each component is queried independently and appears harmless.
User recombines components to reconstruct impermissible content.
Exploits compositionality of language.

Cross-Lingual Attacks : translating between high and low-resource languages for attacking multi-lingual capability

The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
Low-resource languages are those with limited training data, like Zulu.
The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
The attack exploits uneven multilingual training of GPT-4's safety measures.

About

A suite of red teaming and evaluation frameworks for language models

https://haystack-evals.vercel.app/

red teaming

MIT License

Languages

Language:TypeScript 96.5%Language:JavaScript 2.7%Language:CSS 0.8%