DATA-DOG / go-txdb

Immutable transaction isolated sql driver for golang

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Occasional sql: transaction has already been committed or rolled back errors

samsondav opened this issue · comments

commented

We are seeing transaction has already been committed or rolled back errors if we try to use the database "too quickly" after initializing it, with a cancel context.

Code looks like this:

func init() {
	dbURL := os.Getenv("DATABASE_URL")
	if dbURL == "" {
		log.Fatal("You must provide a DATABASE_URL environment variable")
	}

	txdb.Register("txdb", "postgres", dbURL)
}


func DBWithDefaultContext(db *gorm.DB, fc func(db *gorm.DB) error) error {
	ctx, cancel := context.WithTimeout(context.Background(), 10 * time.Second)
	defer cancel()
	return fc(db.WithContext(ctx))
}

func Test_example(t *testing.T) {

db, err := sql.Open("txdb", uuid.NewV4().String())
require.NoError(t, err)
t.Cleanup(func() { assert.NoError(t, db.Close()) })
gormDB, err := gorm.Open(postgres.New(postgres.Config{
	Conn: sqlDB,
	DSN:  uuid.NewV4().String(),
}), &gorm.Config{})
require.NoError(t, err)

err = DBWithDefaultContext(gormDB, func(db *gorm.DB) error {
	return db.Order("created_at ASC, address ASC").Where("deleted_at IS NULL").Find(&keys).Error
})

err := db.Exec(`SELECT 1`).Error // ERROR: sql: transaction has already been committed or rolled back
}

Note that this error about "transaction already committed or rolled back" has nothing to do with postgres (actually the transaction never gets closed). The postgres logs look like this:

2021-07-15 10:05:52.719 BST [63393] LOG:  statement: ;
2021-07-15 10:05:52.720 BST [63393] LOG:  statement: BEGIN READ WRITE
2021-07-15 10:05:52.720 BST [63393] LOG:  statement: SELECT * FROM "keys" WHERE deleted_at IS NULL AND "keys"."deleted_at" IS NULL ORDER BY created_at ASC, address ASC
// ENDS HERE

It must be a timing issue because it can be resolved by inserting a sleep or gosched here:

db, err := sql.Open("txdb", uuid.NewV4().String())
require.NoError(t, err)
t.Cleanup(func() { assert.NoError(t, db.Close()) })
runtime.Gosched() // FIXES THE TEST
// or
time.Sleep(1 * time.Millisecond) // FIXES THE TEST
// ... rest of the test code

Note that I was running this test with a parallelism flag of 1 - single threaded mode.

UPDATE:

It seems that calling runtime.Gosched() only sometimes fixes it - not always. Calling db.Exec('SELECT 1') in place of that seems to be a more reliable fix.

I am also getting this since v0.1.4.
runtime.Gosched and time.Sleep do not fix it for me
The only fix for me was to revert to v0.1.3

should be fixed by commit now.
the patch before introduced a chance for deadlock, but it also was necessary to prevent shared resource access. hope this will solve both issues. please reopen in case if you see any issue on v0.1.5

@l3pp4rd sorry but I am still getting the errors even when using v0.1.5. Using v0.1.3 shows no errors.
sql: transaction has already been committed or rolled back

I'm also experiencing it on v0.1.5 also stuck with gorm in our project. Any update on this?

I think I've tracked this down. You can reproduce it by creating a db then issuing an ExecContext. After than first insert is completed, the context that begins the tx is still being used. That insert is over, and the context cancelation shouldnt' affect the db as a whole, but rather the exec. After the first ExecContext is completed successfully, cancel the context used for that exec, and create a new insert with new context and it fails with tx has been rolled back. I'm going to see if I can work this up a bit more. When I add this test to the db_test.go it fails as expected:

failing test
func TestShouldHandleStmtsWithoutContextPollution(t *testing.T) {
	t.Parallel()
	for _, driver := range drivers() {
		t.Run(driver, func(t *testing.T) {
			db, err := sql.Open(driver, "contextpollution")
			if err != nil {
				t.Fatalf(driver+": failed to open a connection, have you run 'make test'? err: %s", err)
			}
			defer db.Close()

			const insertSQL = "INSERT INTO users (username, email) VALUES (?, ?)"

			ctx1, cancel1 := context.WithCancel(context.Background())
			defer cancel1()

			_, err = db.ExecContext(ctx1, insertSQL, "first", "first@foo.com")
			if err != nil {
				t.Fatalf("unexpected error inserting user 1: %s", err)
			}
			cancel1()

			ctx2, cancel2 := context.WithCancel(context.Background())
			defer cancel2()

			_, err = db.ExecContext(ctx2, insertSQL, "second", "second@foo.com")
			if err != nil {
				t.Fatalf("unexpected error inserting user 2: %s", err)
			}
			cancel2()

			rows, err := db.QueryContext(context.Background(), "select username from users")
			if err != nil {
				t.Fatalf("unexpected error querying users: %s", err)
			}
			defer rows.Close()

			var users []string
			for rows.Next() {
				var user string
				err := rows.Scan(&user)
				if err != nil {
					t.Errorf("unexpected scan failure: %s", err)
					continue
				}
				users = append(users, user)
			}
			sort.Strings(users)

			wanted := []string{"first", "second"}

			if len(users) != 2 {
				t.Fatalf("invalid users received; want=%v\tgot=%v", wanted, users)
			}
			for i, want := range wanted {
				if got := users[i]; want != got {
					t.Errorf("invalid user; want=%s\tgot=%s", want, got)
				}
			}
		})
	}
}

submitted PR to address the context pollution issue. Its passing all tests and runs without a hitch locally as well. Fixes the issues I was seeing.

@veqryn, I'm curious if you'll get errors with the branch I just pushed to fix an issue with context handling. Its fixed both gorm and sqlx tests locally.

now that PR #48 has merged, I think its safe to close this issue

@samsondav, I’m going to close this issue. It should be resolved now. Feel free to return if the issue resurfaces on v0.1.6 or later 👍

We're seeing this in our tests. Our testcases each have their own context that gets cancelled when the testcase finishes.

There is a race condition that the whole transaction may be canceled and the next testcase will use the canceled c.tx. Specifically, the select statement in that goroutine may observe either <-done which doesn't do anything or <-ctx.Done() which cancels rootCtx, triggering the error on the next operation.

go-txdb/db_go18.go

Lines 56 to 58 in 5bd9aad

case <-ctx.Done():
// operation was interrupted by context cancel, so we cancel parent as well
c.cancel()

As a temporary fix, we just removed the whole goroutine to prevent the global transaction being rolled back by a single operation's context.

It must be a timing issue because it can be resolved by inserting a sleep or gosched here:

By inserting a delay before the context gets canceled gives the goroutine in beginTxOnce time to select the <-done case but it is not deterministic.