Cutwell / canary

LLM prompt injection detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Canary

LLM prompt injection detection.

License: MIT PyTests Pre-commit

How it works

  1. User submits a potentially malicious message.
  2. The message is passed through a LLM prompted to format the message plus a unique key into a JSON. In the event the message is a malicious prompt, this output should be compromised. If the output is an invalid JSON, is missing a key, or a key-value doesn't match the expected values, then the integrity may be compromised.
  3. If the integrity check passes, the user message is forwarded to the guarded LLM (e.g.: the application chatbot, etc.).
  4. The API returns the result of the integrity test (boolean) and either the chatbot response (if integrity passes) or an error message (if integrity fails).
graph TD
    A[1. User Inputs Chat Message] --> B[2. Integrity Filter]
    B -->|Integrity check passes.| C[3. Generate Chatbot Response]
    B -->|Integrity check fails.\n\nResponse is error message.| D
    C -->|Response is chatbot message.| D[4. Return Integrity and Response]

What this solution can do:

  • Detect inputs that override an LLMs initial / system prompt.

What this solution cannot do:

  • Neutralise malicious prompts.

Install dependencies

If using poetry:

poetry install

If using vanilla pip:

pip install .

Usage

Set your OpenAI API key in .envrc.

To run the project locally, run

make start

This will launch a webserver on port 8001.

Or via docker compose (does not use hot reload by default):

docker compose up

Query the /chat endpoint, e.g.: using curl:

curl -X POST -H "Content-Type: application/json" -d '{"message": "Hi how are you?"}' http://127.0.0.1:8000/chat

To run unit tests:

make test

Contributing

For information on how to set up your dev environment and contribute, see here.

License

MIT

About

LLM prompt injection detection

License:MIT License


Languages

Language:Python 74.2%Language:Makefile 21.8%Language:Dockerfile 3.5%Language:Shell 0.6%