minio / minio

The Object Store for AI Data Infrastructure

Home Page:https://min.io/download

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Violation of Strict Consistency for HEAD-Object after DELETE-Object on buckets with Bucket Replication

ChristophWersal opened this issue · comments

We are experiencing inconsistent behavior with the HEAD Object operation following a successful DELETE Object operation on buckets where asynchronous bucket replication is enabled. Specifically, in about 25% of cases, the HEAD Object operation still returns a success response even after the object has been deleted. This issue does not occur in buckets without replication. The issues was reproduced with a php and Deno S3 library. The DELETE/HEAD operations are always performed on the "main" MinIO cluster from which the bucket is replicated to a "secondary" MinIO cluster.

Expected Behavior

After a successful DELETE Object operation, subsequent HEAD Object operations should not find the object (i.e., should return a 404 Not Found error).

Current Behavior

In approximately 25% of the tries, the HEAD Object operation still succeeds indicating that the object is still present, despite the DELETE Object operation returning a success status.

After waiting approximately 250ms after the DELETE Object operation, the HEAD Object operation always succeeds in our tests (it returns 404 - not found).

Steps to Reproduce

  1. Enable bucket replication between two MinIO clusters (main MinIO => secondary MinIO).
  2. Upload an object to the replicated bucket (to the main MinIO).
  3. Delete the object using the DELETE Object API (from the main MinIO).
  4. Immediately perform a HEAD Object operation on the same object (on the main MinIO).

As the issue only occurs in ~25% of the tries in our setup, we use the load testing tool Apache Bench to run 1000 iterations (50 iterations in parallel).

Regression

Only tested our current version of MinIO.

Your Environment

  • Version used: minio version RELEASE.2024-04-18T19-09-19Z
  • Server setup and configuration:
    • two clusters
    • 5 nodes per cluster
    • 10 disks (HDD 16TB each) per node
    • disks encrypted with LUKS, filesystem xfs
  • Operating System and version:
    • Linux, Ubuntu 20.04
    • 5.4.0-176-generic x86_64

Thanks for the detailed report!

Ping @poornas @krisis - I will see if I can reproduce and see where the race is happening.

@ChristophWersal One question. Are the deletes and headobject for a specific version or just for the "top" object - meaning requests for object without version id?

I am trying mc cp 0.txt myminio/testbucket&&mc rm myminio/testbucket/0.txt&&mc stat myminio/testbucket/0.txt on a replicated bucket. This creates a delete marker and doesn't delete the older versions and from casual testing it doesn't reproduce.

@klauspost We only interact with the "top" object - we do not use the version id at all.

I just used your example with "mc" in a bash loop. I ran it 100 times (not parallel) and it failed 4 out of 100 times. Maybe calling "mc" from bash is a bit slower than the version I used with Typescript/Deno or with php.

One information I forgot: the tests I ran with Deno ran against a load balancer (haproxy) that points with a "leastconn" config to any of the five nodes in the main MinIO cluster.

I just ran my Deno test gain - this time on the first node of the cluster itself and I directed the MinIO requests only to this first node. Then the failure rate increase to around 74%. It seems that the faster the requests are, the higher the failure rate.

Thanks. I am not going to speculate too much. That said my guess is there is a window in the replication process where the request is proxied to the replica and the replica serves the headobject before the replication has deleted the object.

Inserting time.Sleep(time.Second) at the start of replicateDelete reproduces it reliably.

We just updated both clusters to the new version "2024-05-07T06-41-25Z". We can confirm that we cannot reproduce the inconsistency anymore. Thanks a lot for the very quick fix.