[BUG]: Inconsistent conflict behavior when running transactions reading/writing unrelated keys across many goroutines

Question

[BUG]: Inconsistent conflict behavior when running transactions reading/writing unrelated keys across many goroutines

nick-jones opened this issue a year ago · comments

Nicholas Jones commented a year ago

What version of Badger are you using?

v4.1.0 (tested on latest main too)

What version of Go are you using?

go version go1.20.3 darwin/arm64

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, CPU, OS)?

MacBook Pro M2

What steps will reproduce the bug?

When 2 concurrent transactions read a non-existent key, and then write to that key, the behavior I currently observe is that one of the transactions will conflict (this, I believe, is expected). For example:

package main

import (
	"log"
	"os"
	"time"

	"github.com/dgraph-io/badger/v4"
	"golang.org/x/sync/errgroup"
)

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

func run() error {
	dir, err := os.MkdirTemp(os.TempDir(), "badger")
	if err != nil {
		return err
	}
	defer func() {
		log.Printf("cleaning up %s", dir)
		_ = os.RemoveAll(dir)
	}()

	db, err := badger.Open(badger.DefaultOptions(dir).WithLoggingLevel(badger.ERROR))
	if err != nil {
		return err
	}
	defer func() {
		_ = db.Close()
	}()

	key := []byte("key-1")

	eg := errgroup.Group{}
	eg.Go(func() error {
		return db.Update(func(txn *badger.Txn) error {
			<-time.After(time.Second) // pause long enough to ensure the other goroutine is running
			_, _ = txn.Get(key)
			return txn.Set(key, []byte("value-1"))
		})
	})
	eg.Go(func() error {
		return db.Update(func(txn *badger.Txn) error {
			<-time.After(time.Second) // pause long enough to ensure the other goroutine is running
			_, _ = txn.Get(key)
			return txn.Set(key, []byte("value-2"))
		})
	})
	return eg.Wait()
}

...yields...

$ go run .
2023/06/04 19:26:09 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger3191223375
2023/06/04 19:26:09 Transaction Conflict. Please retry
exit status 1

Another way to observe this behavior is to use 2 manually managed transactions and "inline" the executed steps as it were, i.e. execute one after the other (with the 2 transactions active at the same time). This is generally an easier way to trigger this behavior. For example:

package main

import (
	"fmt"
	"log"
	"os"

	"github.com/dgraph-io/badger/v4"
)

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

func run() error {
	dir, err := os.MkdirTemp(os.TempDir(), "badger")
	if err != nil {
		return err
	}
	defer func() {
		log.Printf("cleaning up %s", dir)
		_ = os.RemoveAll(dir)
	}()

	db, err := badger.Open(badger.DefaultOptions(dir).WithLoggingLevel(badger.ERROR))
	if err != nil {
		return err
	}
	defer func() {
		_ = db.Close()
	}()

	key := []byte("key-1")

	tx1 := db.NewTransaction(true)
	defer tx1.Discard()

	tx2 := db.NewTransaction(true)
	defer tx2.Discard()

	_, _ = tx1.Get(key)
	_ = tx1.Set(key, []byte("value-1"))

	_, _ = tx2.Get(key)
	_ = tx2.Set(key, []byte("value-1"))

	if err = tx1.Commit(); err != nil {
		return fmt.Errorf("tx1 failed: %w", err)
	}
	if err = tx2.Commit(); err != nil {
		return fmt.Errorf("tx2 failed: %w", err)
	}
	return nil
}

..outputs..

$ go run .
<snip>
2023/06/04 19:27:41 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger3558899794
2023/06/04 19:27:41 tx2 failed: Transaction Conflict. Please retry
exit status 1

Now, I have a couple of transactions that are slightly more convoluted than the above, but they are expected to conflict with each other for the same reasons. These transactions are being executed from different goroutines, so conflicts can happen on a general basis. To simplify things, I've done the same as the above - I've taken what would be executed by 2 goroutines concurrently and "inlined" the steps to highlight a scenario where these 2 transactions could conflict with each other (this is the doWork function below). This "inlined" version conflicts as expected. However, If I run those steps across many goroutines at the same time (each operating on completely different keys), 1 of the executions fails to conflict, unexpectedly.

So, with the following code:

package main

import (
	"crypto/rand"
	"errors"
	"fmt"
	"log"
	"math/big"
	"os"
	"sync/atomic"
	"time"

	"github.com/dgraph-io/badger/v4"
	"github.com/google/uuid"
	"golang.org/x/sync/errgroup"
)

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

func run() error {
	dir, err := os.MkdirTemp(os.TempDir(), "badger")
	if err != nil {
		return err
	}
	defer func() {
		log.Printf("cleaning up %s", dir)
		_ = os.RemoveAll(dir)
	}()

	db, err := badger.Open(badger.DefaultOptions(dir).WithLoggingLevel(badger.ERROR))
	if err != nil {
		return err
	}
	defer func() {
		_ = db.Close()
	}()

	eg := errgroup.Group{}
	var conflicts uint64
	for i := 0; i < 1_000; i++ {
		i := i
		eg.Go(func() error {
			err := doWork(db, i)
			if errors.Is(err, badger.ErrConflict) {
				atomic.AddUint64(&conflicts, 1)
				return nil
			}
			return fmt.Errorf("unexpected result: err = %v (i = %d)", err, i)
		})
	}
	if err = eg.Wait(); err != nil {
		log.Printf("failed with %d conflicts", conflicts)
		return err
	}

	log.Printf("completed as expected with %d conflicts", conflicts)

	return nil
}

func doWork(db *badger.DB, i int) error {
	delay()

	key1 := fmt.Sprintf("v:%d:%s", i, uuid.NewString())
	key2 := fmt.Sprintf("v:%d:%s", i, uuid.NewString())

	tx1 := db.NewTransaction(true)
	defer tx1.Discard()
	tx2 := db.NewTransaction(true)
	defer tx2.Discard()

	_ = getValue(tx2, key1)
	_ = getValue(tx2, key2)
	_ = getValue(tx1, key1)
	_ = getValue(tx2, key1)
	setValue(tx2, key1, "value1-placeholder")
	setValue(tx2, key2, "value2")

	if err := tx2.Commit(); err != nil {
		return fmt.Errorf("tx2 failed: %w (key1 = %s, key2 = %s)", err, key1, key2)
	}

	setValue(tx1, key1, "value1")
	_ = getValue(tx1, key1)
	setValue(tx1, key1, "updated-value1")

	delay()
	if err := tx1.Commit(); err != nil {
		return fmt.Errorf("tx1 failed: %w (key1 = %s, key2 = %s)", err, key1, key2)
	}
	return nil
}

func getValue(txn *badger.Txn, key string) string {
	val, err := txn.Get([]byte(key))
	if err != nil {
		if errors.Is(err, badger.ErrKeyNotFound) {
			return ""
		}
		panic(err)
	}
	data, err := val.ValueCopy(nil)
	if err != nil {
		panic(err)
	}
	return string(data)
}

func setValue(txn *badger.Txn, key, value string) {
	if err := txn.Set([]byte(key), []byte(value)); err != nil {
		panic(err)
	}
}

func delay() {
	jitter, err := rand.Int(rand.Reader, big.NewInt(100))
	if err != nil {
		panic(err)
	}
	<-time.After(time.Duration(jitter.Int64()) * time.Millisecond)
}

This is the output I get with repeated running:

$ while go run .; do echo "---"; done
2023/06/04 19:58:58 completed as expected with 1000 conflicts
2023/06/04 19:58:58 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger944970449
---
2023/06/04 19:58:58 completed as expected with 1000 conflicts
2023/06/04 19:58:58 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger358100745
---
2023/06/04 19:58:59 completed as expected with 1000 conflicts
2023/06/04 19:58:59 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger373541670
---
2023/06/04 19:59:00 completed as expected with 1000 conflicts
2023/06/04 19:59:00 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger55216248
---
2023/06/04 19:59:00 failed with 999 conflicts
2023/06/04 19:59:00 cleaning up /var/folders/fd/my_tbdw53yj8rn0gb_m7dlm40000gn/T/badger2814289974
2023/06/04 19:59:00 unexpected result: err = <nil> (i = 101)
exit status 1

So essentially 1 execution of the doWork() function has failed to conflict, whilst 999 (and 4k prior) executions did conflict as expected. Note that the inclusion of the random delays seems to help trigger the conditions for this, though it is unclear to me why.

Expected behavior and actual result.

It's unclear to me if my expectation for things to conflict is reasonable, but it does strike me as odd that the behaviour can apparently differ. Failure to conflict can result in loss of writes.

Additional information

same behavior is observed when using in-memory mode.
behavior is still observed even with a much reduced number of goroutines running (e.g. 10)

Aman Mangal · Answer 1 · Tue Aug 29 2023 14:20:47 GMT+0800 (China Standard Time)

Thanks for filling a detailed bug, I am looking into it. It seems like when txn timestamps (readTs) are zero, it doesn't seem to conflict. I am trying figure out why that is the case.

Aman Mangal · Answer 2 · Tue Aug 29 2023 18:10:30 GMT+0800 (China Standard Time)

This seems like a bug to me, the read watermark used by badger does not handle the case when ts=0 well. I am looking for a solution for it. Thanks again for filing a reproducible bug.