protectai / llm-guard

The Security Toolkit for LLM Interactions

Home Page:https://llm-guard.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stream support

boxabirds opened this issue · comments

LLMs are currently too slow in many use cases for guards to be placed at end of completion
Any real time use case relies on token streaming because the wait times for an _x_k response can be 10, 20 or even 30s. A starting UX principle is to accept that a response > 1s will disrupt a user's flow and > 10s and they'll task switch.

Support for streamed interpretation
I think any practical LLM securitiy solution needs to be streamed. That is, signal detection needs to be assessed quickly, and incrementally, token by token. I had thought that some kind of parallel GAN might be an interesting approach here.

Describe alternatives you've considered
Currently the guardrails I'm using are essentially prompt-based. There are tools like youai.ai that do a remarkable job of putting guardrails in place on streamed output: I did a like for like comparison of the same prompt ChatGPT vs YouAI.ai and YouAI could not be jailbroken on my basic jailbreak tests, where ChatGPT did.

Hey @boxabirds ,
Thanks for raising it. You are definitely right that streaming is required when building customer-facing applications.

Do you know how youai.ai does that? Based on their website, it's not shown. I know that Azure has guardrails with support for streaming that we could look at.

We will experiment with some approaches like analyzing individual generated logits to see the accuracy.

I suspect youai has a prompt solution. But I think the innovation will be on systems that can do ultra fast signal detection. I think companion special purpose microLLMs (“small” LMs? 🙂) might be an interesting area to explore. Inference times would need to be in the milliseconds for it to work I think.

Hey @boxabirds , here is an example of running LLM Guard with ChatGPT Streaming: https://github.com/protectai/llm-guard/blob/main/examples/openai_streaming.py

It can be further optimized with asyncio library (run in parallel).