[FEAT] Support concurrent updates for Zep users

Question

[FEAT] Support concurrent updates for Zep users

nicoeiris11 opened this issue 4 months ago · comments

Is your feature request related to a problem? Please describe.
I'm having a lot of problems IN PROD updating Zep users in async tasks. I run Celery tasks to process data and save it to Users metadata. Different async tasks could potentially update the same user metadata at the same time. I'm aware that Zep implements DB advisory locks in the update as follows:

However, not having control in my app over this behavior makes my data inconsistent and generates many tasks to fail on my side (Zep ApiError).

Describe the solution you'd like
I'd like better handling of locks on Zep side, implementing:

A method to check if a user is locked
Make the update and aupdate methods to wait until a lock is released to perform the operation.
A method to force the release of a user lock.

Describe alternatives you've considered
Either having the Zep user update operation to wait until the lock is released to update the user, or , another strategy could be providing a method to check if a user is locked, so the client can wait until the user is locked to perform the next update op.
Also, having a method to force the release of a lock if something is wrong on Zep side could also be useful.

Additional context
Currently, the solution I implemented which is not working, is each time I want to perform an update, I do the following:

Check if user_id in Redis cache
IF present wait 20 seconds
IF present again raise exception
IF not present save user_id in cache
run zep.update
remove user_id from cache

This is the way I found to handle concurrent updates in my Celery async tasks.
But still having issues for many cases. Any thoughts? Any better possible solution?

Thank you so much and great work guys!!

Daniel Chalef · Answer 1 · Fri Mar 22 2024 02:13:54 GMT+0800 (China Standard Time)

Thanks for raising this. We candidly did not design user metadata handling with high concurrency requirements in mind. It would be helpful to understand how you're using user metadata. What data are you storing in the user object?

Nicolás Eiris · Answer 2 · Fri Mar 22 2024 02:53:47 GMT+0800 (China Standard Time)

Hi @danielchalef , thanks for your quick response.

My data flow is the following:

Client requests a new user attaching a text file associated to it (this file contains 2 sections to be summarized at the same time)
My FastAPI service creates the user in zep with empty metadata and triggers 2 Celery task associated to the user (each async task will do heavy the processing of summarizing each section).
At this point, I queue both Celery tasks in the "main" thread and update user metadata saving both Celery tasks ID.
Each Celery async task process summary of the corresponding section and when it's done updates the user metadata in Zep with the resulting summary.

So in the user metadata I save:

User info
Celery task 1 ID
Celery task 2 ID
Celery task 1 summary result (section 1 of the file)
Celery task 2 summary result (section 2 of the file)

What happens is that sometimes both async tasks or even "main" thread updates, and one of the async tasks attempt to update at the same time and the app breaks because of APIError (produced by the advisory lock in pg db).

Ideally, one of the concurrent updates should keep waiting for the release to happen. Or, at least, the API should provide a method like zep.user(user_id).is_locked() to avoid the APIError and wait.

Daniel Chalef · Answer 3 · Mon Mar 25 2024 05:28:50 GMT+0800 (China Standard Time)

Candidate fix in #329

Daniel Chalef · Answer 4 · Mon Mar 25 2024 07:12:52 GMT+0800 (China Standard Time)

@nicoeiris11 Zep v0.24.0 includes an experimental approach to locking user metadata that should cope better with high-concurrency updates to the same user record. Please try it out and let me know if this fixes your issue. We'll apply this fix more widely if so.

Nicolás Eiris · Answer 5 · Tue Mar 26 2024 05:04:56 GMT+0800 (China Standard Time)

@danielchalef, thank you so much for the quick fix.

I tested v.0.24.0 and the number of failed updates decreased significantly, but still had some cases in which Zep failed after the 3 retries. Is it possible to experiment with a retry policy with exponential backoff and more time between attempts?
For an environment with a heavy load and multiple users updated concurrently, 200ms doesn't seem to be enough time for locks to be released. What about something like starting with 5 seconds with exponential backoff?

I appreciate your time and dedication to improving the user experience/development of the tool.

Daniel Chalef · Answer 6 · Tue Mar 26 2024 07:11:52 GMT+0800 (China Standard Time)

Good point. An exponential backoff may work better here. I'm loath to increase the max backoff time beyond ~10 seconds, preferring the client time out and retry. See #330

Daniel Chalef · Answer 7 · Tue Mar 26 2024 09:00:04 GMT+0800 (China Standard Time)

v0.25.0 - Use Exponential Backoff for Metadata Lock Fails is building. Please let me know your thoughts once you've had a chance to check it out.

Nicolás Eiris · Answer 8 · Tue Mar 26 2024 09:22:42 GMT+0800 (China Standard Time)

@danielchalef thank you so much for releasing a new version with the changes so fast.

I'll let you know as soon as I have updates from the new tag testing.

Nicolás Eiris · Answer 9 · Wed Mar 27 2024 00:48:22 GMT+0800 (China Standard Time)

@danielchalef I want to confirm that with v0.25.0 my load test passed 100% ok.

There will always be a threshold of requests load that will cause Zep to return time out due to the number of concurrent DB operations. But from my last experiments, I didn't notice ApiError due to locks anymore (only time outs when I manually force a very high load scenario).

I want to thank you for your hard work and dedication to maintaining the repo and addressing developers' issues.
Best regards!

Daniel Chalef · Answer 10 · Wed Mar 27 2024 00:51:18 GMT+0800 (China Standard Time)

@nicoeiris11 Great to hear! If you're using Zep's default docker-compose setup, the Postgres instance is not tuned for production use. There is however plenty of literature online around sizing and tuning Postgres implementations.

Nicolás Eiris · Answer 11 · Wed Mar 27 2024 00:57:47 GMT+0800 (China Standard Time)

@danielchalef Yes, I'm using docker-compose in production pointing to stable tag release.
Any specific recommendation/resource about pg tuning to better leverage this docker service?

Daniel Chalef · Answer 12 · Wed Mar 27 2024 01:04:02 GMT+0800 (China Standard Time)

The Postgres website has good guidance on tuning. I've also found this tool useful: https://pgtune.leopard.in.ua/