Support inference URLs for models used by scanners

Question

Support inference URLs for models used by scanners

adrien-lesur opened this issue 8 months ago · comments

Is your feature request related to a problem? Please describe.
My understanding of the documentation and the code is that llm-guard will lazy-load the models required by the chosen scanners from Huggingface. I apologize if this is incorrect

This is not ideal for consumers like Kubernetes workloads because :

When llm-guard is used as a library
- each pod will download the same models, wasting resources
- k8s workloads are usually preferred with low resource allocations to do efficient horizontal scaling.
With "usage as API" scenario to have an llm-guard-api dedicated deployment with more resources
- you might still want your llm-guard-api deployment to scale too, and you face the same resource optimization issue.

A third option is that you already have the models deployed somewhere in a central place so that the only information required by the scanners would be the inference URL and the authentication.

Describe the solution you'd like
Users that use a platform to host and run models in a central place should be able to provide inference URLs and authentication to the scanners, instead of lazy-loading the models.

Describe alternatives you've considered
The existing possible usages described by the documentation (as a library or as API).

Oleksandr Yaremchuk · Answer 1 · Sat Feb 24 2024 04:30:57 GMT+0800 (China Standard Time)

Hey @adrien-lesur , at some point, we considered having the support of HuggingFace Inference Endpoints but we learned that it's not used widely.

How would you usually deploy those models? I assume https://github.com/neuralmagic/deepsparse or something.

adrien-lesur · Answer 2 · Mon Feb 26 2024 16:17:59 GMT+0800 (China Standard Time)

Hi @asofter,
The models would usually be deployed via vLLM like documented here for Mistral.