rethinkdb / rethinkdb

The open-source database for the realtime web.

Home Page:https://rethinkdb.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Execute queries in parallel on a single connection

AtnNn opened this issue · comments

Not to be confused with #2156, which is about evaluating parts of a single query simultaneously.

The JavaScript driver and many community drivers are asynchronous, but they don't get the full benefit of it because the server will only evaluate one query at a time on a single connection.

This also causes pathological behaviour when using changefeeds. Each changefeed adds a possible latency of 500ms to all other queries on the same connection. The only sane way to use changefeeds is to open a new connection for each changefeed. I believe this problem would be better solved by allowing parallel execution of queries on a single connection and not by adding server pools.

Another disadvantage of the current situation is that listening on an empty changefeed will cause an empty response to be sent every 500ms. This causes useless wake-ups and network traffic.

With this proposal, if a user waits for the result of a query before sending the next query to the server, queries would still be executed sequentially.

When this is implemented, we should include a backwards-compatible mode. Some drivers rely on the fact that responses are in the same order as their respective requests.

To avoid complications in the web UI, it may be easier to not make this change for HTTP connections.

This proposal is based on my understanding of the current behaviour, please correct me if I am wrong.

@mlucy @danielmewes Any thoughts?

I don't believe that there exist drivers that rely on responses arriving in the same order as their requests. Which ones do?

@srh I was thinking about community drivers. Over a year ago, if I remember correctly, the Python driver's network code would discard responses until it got one with a matching token.

That's crazy if the Python driver discards responses instead of erroring. It knows the server is sending back garbage! Any driver with a synchronous API shouldn't be sending two queries without waiting for the first to respond, unless the first was a noreply query, or unless it was deliberately designed to do so (which is possible, but unlikely they'd mistakenly rely on message ordering if that was so).

The Python driver does not have that behaviour anymore.

However the JavaScript driver does send new queries to the server without waiting for previous queries to respond. I believe other community drivers do too. It is the behaviour people are led to expect when there are unique tokens in the protocol.

I don't believe any of those are relying on responses arriving in the same order as requests.

The reason we've hesitated to do this is that in asynchronous languages like JS people sometimes write things like:

table.insert(...).run(conn, callback)
table.filter(...).run(conn, callback)

and expect the filter to see the write performed by the insert.

It might be worth just putting up with that problem, though. Tagging this as RQL_proposal, we should talk about it during the next discussion period.

Dropping the guarantee that queries on a given connection wait for all earlier ones to complete before running would also simplify the design of a connection pool API #281 . More discussion in the next RQL_proposal period then.

@mlucy for the Javascript people out there I think it is just a matter of encouraging them to leverage Promises or Generators for pseudo sync execution.

and expect the filter to see the write performed by the insert.

As they should. It would be a bug otherwise.

@srh but the way @mlucy wrote it would be common in Node-land to expect both queries to run without guaranteed order

As a reference, a correct way to sequence queries in JavaScript would be:

table.insert(...).run(conn, function(error, result){
  if(error || result.last_error){ ... }
  table.filter(...).run(conn, callback);
}

exactly, or if you're into generators...

yield table.insert(...).run()
yield table.filter(...).run()

@srh, @thelinuxlich is right. In Nodejs the programmer is responsible for making sure the callbacks fire correctly. I would love not to have to open a new connection for every asynchronous database call.

The reason behind the current behavior was the following: If you issue a write and then a read on the same connection, the read should see the write.

Some people felt strongly about that a long time ago, and we adopted the current behavior. I think it's fair to run the queries in an asynchronous fashion, especially since:

  • The guarantee is wonky. You don't see the read if the write fails, and you have no guarantee that no write will overwrite your first write before the read.
  • If you want to see your write, you can now use returnChanges

The current behavior is also confusing I think. Users currently cannot build a safe connection pool (where a query is guaranteed not to be issued on a connection already used) without automatically coercing cursors, and forbidding feeds. This is mostly because we run CONTINUE queries under the hood. All the work behind rethinkdbdash's pool was to work around this limitation.

Also in my opinion it's expected that if you want a synchronous flow for asynchronous operation in Node.js, you must nest calls, use a library like async, or generators.

@neumino After noticing that my controller methods were getting processed serially, I finally just gave up and started opening new connections for every....single....query.

It almost seems counter-intuitive, but performance shot WAAAAAY up.

Yeah, I'm pretty sure at this point that getting rid of that guarantee is the way to go.

We'll make a complete plan for how to proceed about this after shipping 1.16. As @neumino said, this is also relevant for the question of how to implement connection pools, and also matters for #3298.

Despite a certain potential to expose new bugs, I'm scheduling this for 2.0.

The reason is that having multiple changefeeds open on the same connection (as discussed in #3298 and #3678 ) isn't practicable without this change.

We should be conservative about when to enable this feature. I suggest two restrictions to this on the server side:

  • only ever process one request at a time per request token. Make sure that per token, we send back responses in the same order in which the requests arrived. This avoids a whole bunch of potential race conditions (e.g. what if some cursor implementation sends two CONTINUE requests? what if it sends a CONTINUE and then a STOP before receiving the response to the CONTINUE?).
  • Increase the protocol version magic. Only enable concurrent query execution for drivers that send the new magic. That way we can be sure not to break any existing (third party) drivers that are not prepared to handle parallel query execution.

To be clear regarding "Make sure that per token, we send back responses in the same order in which the requests arrived.":
I think it's enough to just keep the lock on the token until we've completely sent the response. I would avoid doing anything more complex (like pipelining) for 2.0.

I've complained in person that we shouldn't re-use tokens, like, at all; it just invites bugs. If we're bumping the protocol magic anyway we could put something else in like "when you send a CONTINUE request, you supply in addition the token you're going to query next" or alternatively the server sends back a different token, or something similar. The idea being that it should be impossible to send two CONTINUEs to the server before getting a response back from either one, and then trying to make sense of the resulting situation. My intent here is that the server will send back information from one and then go "we already did this" with the other. If the server sends back the token to use for the next request, it's actually impossible to submit two valid read requests on the wire before a response is gotten back from either one.

That said I do agree that we should worry about how much of it should be automatically parallelizable. Changefeeds seem like an obvious instance where the default should be YES, parallelize. Presumably if we made an option to .run that said "please run me in parallel" that would work too. Are there any other default-YES situations we should worry about?

I think we should default to parallelization for all queries (with some cap on the number of coroutines we spawn). For asynchronous drivers like JS it makes more sense, because it's entirely plausible people will just be firing off queries while other queries are queued up.

I think we should default to parallelization for all queries

👍

There seems to be general consensus on this; planning to mark it as settled on Monday.

@mlucy this is done except for #3754 isn't it?

If I open a feed, and send a CONTINUE query, I won't get any response until the server sees a change with (v0_4).
So if no change happens and if I want to close the feed, I have to send the STOP query.

My question is:

  • What happens in this case? I seem to be stuck (no response is returned) for the CONTINUE or STOP query.
  • Is the CONTINUE query supposed to return nothing? An error?

One more thing: If I force the STOP query, and trigger a change after that, the CONTINUE query returns an empty SUCCESS_SEQUENCE, but the STOP query will throw with something like Token X not in cache.

@neumino -- that sounds like a bug to me; the STOP query should be interrupting the CONTINUE query. Good catch!