LlamaEdge / LlamaEdge

The easiest & fastest way to run customized and fine-tuned LLMs locally or on the edge

Home Page:https://llamaedge.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: Create a multi thread api server demo

juntao opened this issue · comments

Summary

This only works when the available RAM is several times the model. I think we could demo a PoC using TinyLlama.

1 Start several api server instances. Each on a different port.

2 Start an nginx server to proxy incoming requests to these api servers.

3 The nginx server will choose the first api server that does not return 503 to its request.

Appendix

No response