getzep / zep

Zep: Long-Term Memory for ‍AI Assistants.

Home Page:https://docs.getzep.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEAT] Support concurrent updates for Zep users

nicoeiris11 opened this issue · comments

Is your feature request related to a problem? Please describe.
I'm having a lot of problems IN PROD updating Zep users in async tasks. I run Celery tasks to process data and save it to Users metadata. Different async tasks could potentially update the same user metadata at the same time. I'm aware that Zep implements DB advisory locks in the update as follows:

image

However, not having control in my app over this behavior makes my data inconsistent and generates many tasks to fail on my side (Zep ApiError).

Describe the solution you'd like
I'd like better handling of locks on Zep side, implementing:

  • A method to check if a user is locked
  • Make the update and aupdate methods to wait until a lock is released to perform the operation.
  • A method to force the release of a user lock.

Describe alternatives you've considered
Either having the Zep user update operation to wait until the lock is released to update the user, or , another strategy could be providing a method to check if a user is locked, so the client can wait until the user is locked to perform the next update op.
Also, having a method to force the release of a lock if something is wrong on Zep side could also be useful.

Additional context
Currently, the solution I implemented which is not working, is each time I want to perform an update, I do the following:

  1. Check if user_id in Redis cache
  2. IF present wait 20 seconds
  3. IF present again raise exception
  4. IF not present save user_id in cache
  5. run zep.update
  6. remove user_id from cache

This is the way I found to handle concurrent updates in my Celery async tasks.
But still having issues for many cases. Any thoughts? Any better possible solution?

Thank you so much and great work guys!!

Thanks for raising this. We candidly did not design user metadata handling with high concurrency requirements in mind. It would be helpful to understand how you're using user metadata. What data are you storing in the user object?

Hi @danielchalef , thanks for your quick response.

My data flow is the following:

  • Client requests a new user attaching a text file associated to it (this file contains 2 sections to be summarized at the same time)
  • My FastAPI service creates the user in zep with empty metadata and triggers 2 Celery task associated to the user (each async task will do heavy the processing of summarizing each section).
  • At this point, I queue both Celery tasks in the "main" thread and update user metadata saving both Celery tasks ID.
  • Each Celery async task process summary of the corresponding section and when it's done updates the user metadata in Zep with the resulting summary.

So in the user metadata I save:

  1. User info
  2. Celery task 1 ID
  3. Celery task 2 ID
  4. Celery task 1 summary result (section 1 of the file)
  5. Celery task 2 summary result (section 2 of the file)

What happens is that sometimes both async tasks or even "main" thread updates, and one of the async tasks attempt to update at the same time and the app breaks because of APIError (produced by the advisory lock in pg db).

Ideally, one of the concurrent updates should keep waiting for the release to happen. Or, at least, the API should provide a method like zep.user(user_id).is_locked() to avoid the APIError and wait.

Candidate fix in #329

@nicoeiris11 Zep v0.24.0 includes an experimental approach to locking user metadata that should cope better with high-concurrency updates to the same user record. Please try it out and let me know if this fixes your issue. We'll apply this fix more widely if so.

@danielchalef, thank you so much for the quick fix.

I tested v.0.24.0 and the number of failed updates decreased significantly, but still had some cases in which Zep failed after the 3 retries. Is it possible to experiment with a retry policy with exponential backoff and more time between attempts?
For an environment with a heavy load and multiple users updated concurrently, 200ms doesn't seem to be enough time for locks to be released. What about something like starting with 5 seconds with exponential backoff?

I appreciate your time and dedication to improving the user experience/development of the tool.

Good point. An exponential backoff may work better here. I'm loath to increase the max backoff time beyond ~10 seconds, preferring the client time out and retry. See #330

v0.25.0 - Use Exponential Backoff for Metadata Lock Fails is building. Please let me know your thoughts once you've had a chance to check it out.

@danielchalef thank you so much for releasing a new version with the changes so fast.

I'll let you know as soon as I have updates from the new tag testing.

@danielchalef I want to confirm that with v0.25.0 my load test passed 100% ok.

There will always be a threshold of requests load that will cause Zep to return time out due to the number of concurrent DB operations. But from my last experiments, I didn't notice ApiError due to locks anymore (only time outs when I manually force a very high load scenario).

I want to thank you for your hard work and dedication to maintaining the repo and addressing developers' issues.
Best regards!

@nicoeiris11 Great to hear! If you're using Zep's default docker-compose setup, the Postgres instance is not tuned for production use. There is however plenty of literature online around sizing and tuning Postgres implementations.

@danielchalef Yes, I'm using docker-compose in production pointing to stable tag release.
Any specific recommendation/resource about pg tuning to better leverage this docker service?

The Postgres website has good guidance on tuning. I've also found this tool useful: https://pgtune.leopard.in.ua/