getzep / zep

Zep: Long-Term Memory for ‍AI Assistants.

Home Page:https://docs.getzep.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEAT] Configurable OpenAI timeouts and retry settings for compatible APIs

esatapedico opened this issue · comments

Is your feature request related to a problem? Please describe.
I'm using LocalAI as OpenAI-compatible API for self-hosted LLM models. I've configured its endpoint for Zep to use it as OpenAI-compatible API for summarization, intent and entity extraction.

My local server is however not that beefy and requests to it can take up to several minutes to complete. Then when Zep starts calling my API for the tasks at hand, requests start timing out, and then retrying kicks in. Not only responses won't come, as also the API gets overloaded and eventually becomes unusable for a while.

I see that there's retry and timeout configured for OpenAI calls, but they seem to be hardcoded at the time, so I couldn't adapt that to my needs.

Describe the solution you'd like
OpenAI timeouts and retries could be configurable through the config file and environment variables, so that the currently hardcoded values can be overriden. Whether that would make sense to have different values for different kinds in requests (summarization, intents, embeddings) I don't know. Maybe it's simpler if it's just one thing.

Describe alternatives you've considered
I've turned out intent and entity extraction as an attempt not to overload my API with too many requests in a short period, but unfortunately even a single summarization request can easily take up to a few minutes in my case. For my use-case it's fine if summarization updates take a bit longer, as long as they eventually complete.

Additional context
I understand that doesn't make total sense when consuming the predictable OpenAI API, but since we can use compatible APIs, they could come with different performance implications. I'm falling back to the OpenAI API in Zep now because I can't use my self-hosted API for it, although I'm successfully using that in my application code (but then, again, my use-case is really lenient to slow responses).

We're refactoring our LLM support with a new release expected late Q1/early Q2. We'll consider making timeouts configurable.