r2dbc / r2dbc-spi

Service Provider Interface for R2DBC Implementations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Statement.add() is Awkward

Michael-A-McMahon opened this issue · comments

I'm finding it a bit awkward to use add() correctly. It requires coding in a branch to have the first set of binds are handled differently, like this:

  /**
   * Binds multiple arrays of {@code values} to a {@code statement} by calling
   * {@link Statement#add()}
   */
  static Publisher<Void> bindMany(
    Statement statement, Publisher<Object[]> values) {

    AtomicBoolean isFirst = new AtomicBoolean(true);
    return Flux.from(values)
      .doOnNext(nextValues -> {
        // Don't call add() if this is the first set of bind values
        if (! isFirst.getAndSet(false))
          statement.add();

        for (int i = 0; i < nextValues.length; i++)
          statement.bind(i, nextValues[i]);
      })
      .then();
  }

I think the SPI would be easier to use and less error prone if it allowed the trailing add(). As a JDBC programmer, I also find the trailing add() to be more intuitive, because this is how JDBC's PreparedStatement.addBatch() works.

With a trailing add(), the code above could be written as:

  /**
   * Binds multiple arrays of {@code values} to a {@code statement} by calling
   * {@link Statement#add()}
   */
  static Publisher<Void> bindMany(
    Statement statement, Publisher<Object[]> values) {
    return Flux.from(values)
      .doOnNext(nextValues -> {
        for (int i = 0; i < nextValues.length; i++)
          statement.bind(i, nextValues[i]);

        statement.add();
      })
      .then();
  }

Am I the only one who thinks this? I'd like to see what others think now, before the 1.0 release solidifies the behavior of add().

With Cloud Spanner, we allowed trailing adds before the specification got clarified to forbid them, and never updated the implementation to be stricter.
It would make sense to me to allow trailing adds for end-user usability. Although with end users not being expected to use SPI directly, this might not be as important.

Although with end users not being expected to use SPI directly, this might not be as important.

I agree that with the target audience being mainly tool vendors, correctness and predictability is more important than usability.

I was the one requesting the specification here: #229

In particular, the comparison with JDBC doesn't hold, because JDBC unambiguously distinguishes between execute() and executeBatch() methods. R2DBC doesn't. Both types of statements are executed with execute(), and as such, there is ambiguity about what the semantics of trailing add() calls could mean. See the edge cases I listed in the above specification request.

I agree the status quo is awkward, but it already was, before this specification request. Forbidding trailing adds was the path to least incompatible changes, as removing awkwardness from the current design would have required more significant re-design (which can still be debated, just wanted to give some background).

There's indeed inconvenience when calling .add() when looping over a set of bindings. A way out could be a different approach to express a binding

statement.addBinding(binder -> binder.bind(0, …))
         .addBinding(binder -> binder.bind(0, …))
         .execute();

which omits the need for calling add altogether. Allowing trailing add() introduces a somewhat inconsistency because you cannot be certain whether the trailing add() creates a new empty binding or whether it leniently falls back to the last properly populated binding.

You could have also something like:

conn.createStatement("UPDATE person SET count = count + 1").add().add().execute()

While the statement is not parametrized, how often does it run? With a lenient approach, it would run only once, with a strict approach it would run twice.

I'm happy to sort out the issue before 1.0 as we seem to have entered a phase where we gather learnings and it makes sense to polish up the SPI.

@mp911de Would this be offered as a convenience default method (fine), or a replacement of the existing API (difficult)? In other words, would addBinding() also be used for non-batch statements?

addBinding (or a better name) would require a breaking change because we cannot distinguish between the first/last call.

There's some tension between "make it convenient to use" and "keep it as is and don't break things".

I think the cost of breaking the SPI would need to be outweighed by the benefit it provides. I clearly see the case for direct SPI users but as you already mentioned, the SPI is intended for libraries/frameworks as users. If we find that we can live with it as things are (although strange to use), then I'd vote to keep things as they are. If we find that we cannot live with it then we need to accept the fact of breaking a pretty core area of query execution code paths.

If we find that we can live with it as things are (although strange to use), then I'd vote to keep things as they are. If we find that we cannot live with it then we need to accept the fact of breaking a pretty core area of query execution code paths.

I think a vote would depend on seeing actual drafts of the improvement, though I currently can't imagine they'll pull their weight.

I think the core problem here is that R2DBC inherited JDBC's flaw of re-using the same type for 2 purposes:

  • Single statement prepared statements
  • Batched prepared statements

In JDBC, I imagine this was done to avoid re-declaring the 500 setInt(...), setString(...) methods twice. R2DBC doesn't have this problem, and in fact already created a separate API for non-prepared statement batches via Connection.createBatch(). I think it would be possible to offer a separate API as well for prepared statement batches, and remove the Statement.add() functionality entirely... Thoughts?

What if we specify that after calling add(), a Statement ignores the current set of binds, unless those are added as well? Would this resolve the ambiguity?

    // Executes 1 INSERT (add is not called)
    connection.createStatement("INSERT INTO t VALUES (0)")
      .execute();

    // Executes 1 INSERT (add is not called)
    connection.createStatement("INSERT INTO t VALUES (?)")
      .bind(0, 0)
      .execute();

    // Executes 1 INSERT (add is called 1 time)
    connection.createStatement("INSERT INTO t VALUES (0)")
      .add()
      .execute();

    // Executes 1 INSERT (add is called 1 time)
    connection.createStatement("INSERT INTO t VALUES (?)")
      .bind(0, 0).add()
      .execute();

    // Executes 2 INSERTs (add is called 2 times)
    connection.createStatement("INSERT INTO t VALUES (0)")
      .add()
      .add()
      .execute();

    // Executes 2 INSERTs (add is called 2 times)
    connection.createStatement("INSERT INTO t VALUES (?)")
      .bind(0, 0).add()
      .bind(0, 1).add()
      .execute();

    // Throws a error: "You forgot to add() your binds!"
    connection.createStatement("INSERT INTO t VALUES (?)")
      .bind(0, 0).add()
      .bind(0, 1).add()
      .bind(0, 2)
      .execute();

@Michael-A-McMahon There's now a different kind of awkwardness (or rather, inconsistency) between these two:

    // Executes 1 INSERT (add is not called)
    connection.createStatement("INSERT INTO t VALUES (?)")
      .bind(0, 0)
      .execute();

    // Throws a error: "You forgot to add() your binds!"
    connection.createStatement("INSERT INTO t VALUES (?)")
      .bind(0, 0).add()
      .bind(0, 1).add()
      .bind(0, 2)
      .execute();

How come a trailing add() is required for 2+ inserts, but not for 1 insert?

The current specification is, in fact, the most logical one if single statements and batched statements are to share the same API. You have to call add() in between two statement bind sets, kinda like a binary operator.

// 1 operand, no + (add) operator
1

// 2 operands, 1 + operator
1 + 2

// 3 operands, 2 + operators
1 + 2 + 3

I don't think anyone would suggest requiring a trailing + in any of the above, let alone requiring it only for 2+ operands?

// Both are legal
1
1 +

// Now, trailing + operators are required
1 + 2 +
1 + 2 + 3 +

The awkwardness criticised in the original comment stems from the fact that this isn't about binary operators, but about a set of chained method calls. If you could Stream.reduce(Statement::add), or rather foldLeft(Statement::add), your bind variables into a statement batch, then maybe it wouldn't feel as awkward, but that's not so simple, because in addition to adding statements to the batch, bind variables have to be bound at the right moment of the reduction/fold.

But perhaps, the underlying idea could be the solution here?

I now understand what has happened: There are at least two mental models of what add() does, and those models are incompatible.

In one model, add() is a mathematical operator, so add() goes between the operands.

In another model (the model that I use) add() pushes a set of binds into a queue, so add() always follows after binds have been set.

https://en.wikipedia.org/wiki/Principle_of_least_astonishment

Under the queue model, it is astonishing that a trailing add() results in an error.
In the mathematical model, it is astonishing if tailing adds were not an error: "a + b + c +"

I'm not sure what to do about this :) What mental model do we think is more intuitive for most programmers?

There are at least two mental models of what add() does, and those models are incompatible.

Yes, and I'm quite impartial to which one is chosen in the end, as long as there is not a single ambiguity left for the not-so-edge-case of executing a single query. Again, my point here is, if this were to be fixed, then the best option I see is to separate the single execution API from the batch API.

Splitting parametrized batches into a different API would change the programming model significantly and break not only drivers but also the calling code. We would reduce confusion greatly by giving the add() method a name that represents what it does:

I'm finished with the current parameter bindings and 
I want to bring the statement into a state where I 
can supply the next parameters until I'm ready to 
execute the statement.

I think add() is already a good name. It adds the current set of binds to the batch, so the verb "add" seems correct. If there is another name that screams "Don't call this after adding your last of binds!", that might be better. Verbs like "concat" or "append" might express that, however I personally like the name as it is.

To me, the most natural behavior of add() would require it to be trailing, as shown in my previous examples. If you have a single set of binds, then you don't call add(). If have (potentially) more than one set of binds, then you always call add(). This seems simple to understand, and is easier to work with as it eliminates the if(isFirst) condition for calling add().

But I think the issues with this idea are:

  1. It means execute() does one of two things. Either it will execute with the current set of binds, or it will ignore the current binds and only execute with added binds. The inconsistency over using the current set of binds might be confusing.
  2. It breaks existing code that has already adopted the behavior of not having a trailing add().

I hope I'm understanding the issues correctly; Please let me know if I'm missing something.

The general SPI approach seems to require rethinking in the context of usability and how parametrized batches should be supported. Right now, we do not have a strong demand to change the SPI as a proper approach requires more thought and we do not want to add something else that we would remove in the future. Rescheduling the ticket for a future release.

commented

Yes, I agree that Statement is awkward. I would appreciate very much if Statement were more convenient to use.
Statement as it is is almost unusable. I created my own class StatementWrapper addressing all the stupid issues of Statement:

  • unclear semantics of add() and awkward boilerplate code
  • awkward handling of possibly nullable values
  • inaccessible number of bindings
class StatementWrapper {

    private Statement statement;
    @Getter
    private int rowCounter = 0;

    private StatementWrapper(@NonNull Connection connection, @NonNull String sql) {
        this.statement = connection.createStatement(sql);
    }

    public static StatementWrapper of(@NonNull Connection connection, @NonNull String sql) {
        return new StatementWrapper(connection, sql);
    }

    /**
     * Unlike {@link Statement#bind(String, Object)}, this method supports both non-{@code null} as well as {@code null} values.
     */
    public <T> StatementWrapper bind(String name, T value, Class<T> type) {
        if (nonNull(value)) {
            statement = statement.bind(name, value);
        } else {
            statement = statement.bindNull(name, type);
        }
        return this;
    }

    /**
     * Unlike {@link Statement#add()}, this method is loop-friendly. 
     */
    public StatementWrapper add() {
        if (rowCounter++ > 0) { // the first call is ignored
            statement = statement.add();
        }
        return this;
    }

    public Publisher<? extends Result> execute() {
        return statement.execute();
    }
}

The usage is then much more straightforward:

public Mono<Integer> insertBooks(Connection connection, List<Book> books) {
    final StatementWrapper statement = StatementWrapper(connection, "INSERT INTO book (author, title) VALUES ($1, $2)");
    books.forEach(book ->
            .add() // the first call is simply ignored
            .bind("$1", book.getAuthor(), String.class)
            .bind("$1", book.getTitle(), String.class)
         );
    return Flux.from(statement.execute())
            .flatMap(Result::getRowsUpdated)
            .reduce(0, Integer::sum);
}