Concurrent transactions on embedded replica connections fail

Question

Concurrent transactions on embedded replica connections fail

sveltespot opened this issue 6 months ago · comments

Facing an issue where multiple concurrent transactions started on separate embedded replica connections fail, with only one of them succeeding. Others fail with error Err(RemoteSqliteFailure(3, 1, "cannot start a transaction within a transaction"))

Tested on remote connections and the scenario works fine there.

Below is a reproducer for this:

use libsql::{Builder, Connection, Result};

#[tokio::main]
async fn main() {
    let db_url = "http://localhost:8080";
    let replica = Builder::new_remote_replica(
        "/tmp/embedded_transaction.db",
        db_url.to_string(),
        String::new(),
    )
    .build()
    .await
    .unwrap();
    let remote = Builder::new_remote(db_url.to_string(), String::new())
        .build()
        .await
        .unwrap();
    let replica_conn_1 = replica.connect().unwrap();
    let replica_conn_2 = replica.connect().unwrap();

    let remote_conn_1 = remote.connect().unwrap();
    let remote_conn_2 = remote.connect().unwrap();

    let remote_task_1 = tokio::task::spawn(async move { db_work(remote_conn_1).await });
    let remote_task_2 = tokio::task::spawn(async move { db_work(remote_conn_2).await });

    let (task_1_res, task_2_res) = tokio::join!(remote_task_1, remote_task_2);
    let remote_task_1_res = task_1_res.unwrap();
    let remote_task_2_res = task_2_res.unwrap();

    // Everything works as expected in case of remote connections.
    assert!(remote_task_1_res.is_ok());
    assert!(remote_task_2_res.is_ok());

    let replica_task_1 = tokio::task::spawn(async move { db_work(replica_conn_1).await });
    let replica_task_2 = tokio::task::spawn(async move { db_work(replica_conn_2).await });

    let (task_1_res, task_2_res) = tokio::join!(replica_task_1, replica_task_2);
    let replica_task_1_res = task_1_res.unwrap();
    let replica_task_2_res = task_2_res.unwrap();

    if replica_task_1_res.is_err() {
        eprintln!("Task 1 failed: {:?}", replica_task_1_res);
    }
    if replica_task_2_res.is_err() {
        eprintln!("Task 2 failed: {:?}", replica_task_2_res);
    }

    // One of these concurrent tasks fail currently. Both tasks should succeed.
    assert!(replica_task_1_res.is_ok());
    assert!(replica_task_2_res.is_ok());
}

async fn db_work(conn: Connection) -> Result<()> {
    let tx = conn.transaction().await?;
    // Some business logic here...
    tokio::time::sleep(std::time::Duration::from_secs(2)).await;
    tx.execute("SELECT 1", ()).await?;
    tx.commit().await?;
    Ok(())
}

sveltespot · Answer 1 · Fri Apr 05 2024 12:17:31 GMT+0800 (China Standard Time)

Since this is a critical issue faced by us at the moment, I would be more than willing to work on this issue, if someone could point me to places I should look into for this bug.

Jeroen Meeus · Answer 2 · Fri Apr 05 2024 17:07:31 GMT+0800 (China Standard Time)

I noticed when connecting to a replication, the writer is cloned (I have 0 experience with Rust, so feel free to correct me if I'm wrong): https://github.com/tursodatabase/libsql/blob/main/libsql/src/database.rs#L552 but this set of my spidey senses

sveltespot · Answer 3 · Fri Apr 05 2024 17:26:45 GMT+0800 (China Standard Time)

I noticed when connecting to a replication, the writer is cloned (I have 0 experience with Rust, so feel free to correct me if I'm wrong): https://github.com/tursodatabase/libsql/blob/main/libsql/src/database.rs#L552 but this set of my spidey senses

I too was thinking along the same lines, but I do think the issue might be in conn.writer() function. This gets/constructs the writer from the remote client present in the replication context which is set during Builder::new_remote_replica(...).build(), which I think is the issue here. Instead IMO, the replication context should only include relevant details to construct this client on demand (when db.connect() is called).

Lucio Franco · Answer 4 · Sat Apr 06 2024 01:16:22 GMT+0800 (China Standard Time)

Thanks for the reproducer I have it also failing locally in a test now and will be taking a look.