rethinkdb / rethinkdb

The open-source database for the realtime web.

Home Page:https://rethinkdb.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

For ordered sequences, have `changes()` return the new/old position

neumino opened this issue · comments

Somehow related to #3709

Typically if I want to build a leaderboard, I'll run something like this:

r.table('users').orderBy({index: 'score'}).limit(10).changes()

And then try to maintain an array of the best users.

The changes I'll get will be like this:

{
    new_val: {id: 14, score: 68},
    old_val: {id: 23, score: 10} 
}

The problem is that while I can remove the last element {id: 23, score: 10} from my array, I have no idea where I should insert the new document {id:14, score: 68} (without going through my array).
It would be nice to return the new/old position in changes() for ordered sequences. Augmenting the data like what ReQL does for getNearest seems like a good idea to me.

Sounds like a good idea.

One more little thing. Is it possible to return a new type ORDERED_FEED? For the same reason ATOM_FEED is useful.

In case of an ordered sequence, I would like to have thinky maintain an array and just emit events when the array is updated.

I added a new issue for the ORDERED_FEED thing, since that's pretty easy. Returning the index is a little bit harder because of squashing. Consider the case where the row A at position 5 is deleted, then the row that is now at position 5 is updated from B to B'. Right now the queued up changes look like [{old_val: A, new_val: nil, old_position: 5}, {old_val: B, new_val: B', old_position: 5, new_position: 5}]. Now if a new row A' is inserted that takes position 5, we squash it onto the first change, and we need to produce [{old_val: A, new_val: A', old_position: 5, new_position: 5}, {old_val: B, new_val: B', old_position: 6, new_position: 6}]. I.e. the new change being squashed onto the first change requires us to update the positions in the second change. We could get around that by keeping a snapshot of the pre-batch ordered set around in memory and adding the position tags when the batch is sent, but that would double memory usage in the worst case and be a bit of implementation effort.

For some situations (e.g. r.table("foo").orderBy(...).changes()) it's impractical to compute the old and new positions. So I think it should be an optarg to .changes(), and it should error if it's impractical to compute positions. Also, #3719 might make this unnecessary.

For ordered feeds, we could maybe squash only the last change? I'm actually fine not squashing things in a first time.

@timmaxw -- why would be impractical to compute the old and new positions? Isn't that "just" about keeping a number of children on each node of the tree?
While it requires more updates on the tree, it would also make things like orderBy(...).skip(x).limit(y) more efficient no?

Since we don't currently support r.table("foo").orderBy(...).changes() without a limit(...), I don't think we have to worry too much about that now. In the future I think throwing an error when it's impractical to compute positions will work fine.

For small result sets #3719 might indeed be an alternative. One thing to note though is that it doesn't save any computations (not asymptotically at least) compared to iterating through the current results set on the client and finding out the positions of the old value and new value that way.
@neumino what do you think about that?

For making this work with squash, we could only keep the primary keys of entries around in the pre-batch ordered set to avoid too much memory overhead. Of course that wouldn't make it any easier to implement.

@neumino The idea of adding subtree sizes to the btree has been suggested a few times, mostly in the context of making count() on ranges efficient. However the problem is that we would need to update those counts, which would mean that whenever inserting or deleting a document into a leaf, we would have to rewrite O(log n) blocks up to the top of the btree. That's impractical. We could keep the counts in the LBA, which is more efficient for those sorts of things, but that comes with some memory overhead.

@danielmewes -- I haven't really thought about this enough, and I would guess we probably won't find time to do this in 2.0; could we bump it to subsequent and settle it later?

@mlucy Yes definitely.

FYI, losing order for feeds has come up literally every time I've explained .orderBy.limit.changes to a user. As far as prioritization of ReQL issues goes, I think this one is pretty high up there.

I think if we implement #3719 this one becomes less urgent. But we should make sure to do at least one of them for 2.1.

I'm not opposed to this, but there's a lot scheduled for 2.1, so unless it's absolutely vital I'd prefer to just do #3719.

@coffeemug, @danielmewes -- are we actually doing this for 2.1, or do you think we can get away with just doing #3719?

I think #3719 is good enough for 2.1.

We all seem to agree that this is useful, but we would have to discuss the names of the arguments and how to integrate the positions into the events.

Moving out of 2.1.

I agree with @danielmewes -- I think #3719 is sufficient for 2.1.

commenting to register my interest publicly, as well as to subscribe to further updates

commented

I'm also going to comment here for a vote in support 👍

@dalanmiller also mentioned that it would be nice if we could support this for unlimited orderBy changefeeds (e.g. r.table(...).orderBy({index: ...}).changes().
It sounds like it wouldn't easily work with our current implementation of those, and I can't see an obvious way of doing it without a lot of memory overhead. But we should think about it a bit.

I'd like to understand the use case for that better. I can't easily think of one, and I don't think we should worry about it unless a compelling use case actually exists.

Honestly, I don't think the positions are important to support in the change feeds. RethinkDB already maintains the ordered sorted set, just w/o positions. Positions can be recalculated efficiently in the db client.

I think supporting it for non-limited orderBy.changes would mostly improve the consistency of the query language and the available features.
If it's non-trivial to do, I don't think we should do it. But if it is, we should do it.

with changefeeds for ordered queries and joins, I could make a really lean integration with Meteor. 👍

@ccorcos -- just FYI, feeds on ordered queries already work. You can already do things like this: r.table('foo').orderBy({index: 'blah'}).limit(10).changes().

Feeds on joins are coming soon (tm).

but those just give changes to the individual document feeds right? are there any notifications indicating that the order has changed?

@ccorcos -- no, they also include notifications that the order has changed. For example, if two documents flip their order, they'll show up in the feed.

Sorry, to clarify -- currently we don't give information about the actual order because the limited number of documents are easy to sort on the client; but we do give all the information you need to do it.

This shipped in 2.3.0 (see #5334).

🌿

Is there a standard way, on the consumer side, to convert this feed of old_pos, new_pos into the final array a user would want to display?

I guess another way of asking this is the following question: For r.table().order_by().limit(10).changes(), I'd like to use a callback that takes the final updated array, not individual (each()) elements.

I'm just asking because I imagine lots of people are doing this, and maybe there's a function or a separate NPM module that makes it easy to coalesce into a result array.

@GeoffreyPlitt sounds a bit like this is what you're after? rethinkdb/docs#1137 (comment)

The pseudo-code in rethinkdb/docs#1137 (comment) works for set-semantics, where the ordering doesn't matter. For an ordered result with old_offset and new_offset fields, the algorithm will look somewhat like this:

// A data structure with (resizable) array semantics
var Array documents;

feed = r.table(...).orderBy(...).limit(...).changes({includeInitial: true, includeStates: true, includeOffsets: true}).run();

for change in feed {
  if (change has field "state" and change.state == "ready") {
    // We have reached a consistent state of the documents in the table.
    // This might be a good time for rendering the results etc.
    // We can simply stay in the `for` loop to keep it updated.
  }
  if (change has field "old_offset" and change.old_offset != null) {
    // Remove the old value at index old_offset from the array
    // (`erase` should "cut out" the value, reducing the index of every
    //  subsequent element by 1)
    documents.erase(change.old_offset);
  }
  if (change has field "new_offset" and change.new_offset != null) { 
    // Add the new value to the array at position new_offset
    documents.insert(change.new_offset, change.new_val);
  }
}