Support incremental map/reduce

Question

Support incremental map/reduce

coffeemug opened this issue 11 years ago · comments

We had this on our radar for a while, but didn't have an issue to track it. Since some people have been asking for an official issue to track, I'm adding this to GitHub.

I'm going to write up a specific proposal a bit later. This is in backlog as it's obviously a medium-term priority feature.

Alisson Cavalcante Agiani commented 10 years ago

+1

Bjørn Arnholtz commented 9 years ago

👍

Joe Doliner · Answer 1 · Thu Jul 04 2013 03:23:20 GMT+0800 (China Standard Time)

How bad would it be if incremental map/reduce jobs could only be registered on a single table? If we limited ourselves to that this would actually become a much simpler problem to solve in the backend.

Slava Akhmechet · Answer 2 · Thu Jul 04 2013 04:26:07 GMT+0800 (China Standard Time)

Hmm, I have to think about it. It might be sufficient for most real use cases, but at first glance this makes me feel really uneasy. One thing MongoDB does that people find extremely annoying is introduce features that don't work with other features of the database. For example, they have plenty of collection types that cannot be sharded, which makes the user experience really frustrating since it moves the burden from developers to users. People can't just use the features they want and have the confidence that they will work.

(I don't think it necessarily means we should restrict functionality, just that this tradeoff comes with connotations frustrating for users, so we should think carefully before we choose to do it this way)

Andrew Pendleton · Answer 3 · Thu Jul 04 2013 06:18:48 GMT+0800 (China Standard Time)

Like, a single incremental job could only operate on data from a single table? Or that each database could only have one table on which incremental jobs could be registered?

For the use case I had in mind when I asked the HN question that I think prompted this ticket, the former would be acceptable, but the latter wouldn't. I have no idea what other uses people have in mind, though.

Joe Doliner · Answer 4 · Thu Jul 04 2013 06:26:44 GMT+0800 (China Standard Time)

@apendleton the former was what I meant. To give people an idea of how much easier it is I think I could probably do the one table case in less than a month while the general case would probably as many as 4-5 months all told. I think it's a feature about on the same scale as secondary indexes which took about that long.

I actually think we should ship the one table case sometime semi soon (I think post 2.0 probably), gauge people's response to it and then expand from there. Also if we had triggers then the one table limitation really wouldn't be that bad because you could write triggers to push data from where you wanted it in to your single table where it would get map reduced. We'd add some sugar on top of that it and could actually be really nice. On top of that a lot of the features for managing tables you actually want for this incremental map reduce stuff as well. Redundancy will make the computed value more available. Sharding can help it scale better.

Slava Akhmechet · Answer 5 · Thu Jul 04 2013 06:58:50 GMT+0800 (China Standard Time)

@jdoliner -- when you get the chance could you explain the design for each of the options? (i.e. single-table option and multi-table option). I'd like to understand how you envision each version would work and where the factor of four-five difference in complexity comes from. (Obviously not urgent since we aren't doing this now)

Andrew Pendleton · Answer 6 · Thu Jul 04 2013 07:14:33 GMT+0800 (China Standard Time)

@jdoliner yeah, that all sounds awesome. We have a currently-Postgres database that I think I want to eventually replace with something-not-Postgres TBD, and we build aggregates on a whole bunch of tables that are very expensive to compute, and currently recompute everything from scratch on updates (additions, deletions, and changes of records). There's occasionally inter-table stuff, but 90% or more is probably single-table, and if we could change records and get new aggregates without recomputing everything from scratch, that would be a huge boon. I think you're absolutely right, too, that that use case is probably much more common than a complicated multi-table MR situation, and that in the interest of 80%-20% solutions, getting the single-table case out the door early would be totally worthwhile.

Joe Doliner · Answer 7 · Sat Jul 06 2013 02:13:36 GMT+0800 (China Standard Time)

@coffeemug actually having thought about this a bit more I think the multi-table version of this is less a question of being complicated from an engineering perspective and more a question of being algorithmically untenable. You can imagine even a fairly simple multi-table mapreduce such as: table.eq_join("foo", table2).map(...).reduce(...) is very complicated to keep track of in an incremental way, and in a lot of cases downright impossible. Even a single row change in table2 can conceivably change the value of every single piece of data going in to the map reduce so there's really just no efficient way to compute an incremental view without basically rerunning the map reduce for every change to table2. We could maybe make some optimizations that made it more efficient if you had an approximately one-to-one join (which is probably the most common case) but that's going to be a big undertaking that only works in very specific cases which will be hard to explain to people and will behave very badly when it's used outside of those cases. Furthermore if people start using arbitrary sub expressions like: table.map(lambda x: query_on_table2(x)).reduce(lambda x,y: query_on_table3(x,y)) then all bets are off.

I definitely agree that it's annoying to have 2 features which aren't compatible but I think the reality is this is a situation where you can't sugarcoat the algorithmic limitations. Doing so is just going to lead to people bumping in to the limitations as exponential runtimes which is clearly a lot worse.

My conclusion here is that the easier thing of having map reduce jobs rooted on a single table is actually the right thing to do because it's something I know we can make fast and make in to a very useful feature. Also it's really a very doable thing because almost all the annoying parts of it are already written and "working" for secondary indexes and the system was designed to be easily extended to support incremental map reduce. I'll write up a full proposal for this at some point in the near future.

Slava Akhmechet · Answer 8 · Sat Jul 06 2013 05:59:02 GMT+0800 (China Standard Time)

@jdoliner -- this makes a lot of sense. I changed my mind -- I think it's ok to make this feature work on a single table and it's probably ok to never make it work on multiple tables. Actually, we already have precedent where we do best effort on non-deterministic queries, and generally handle them differently from deterministic ones. This would be no different.

Slava Akhmechet · Answer 9 · Thu Jun 19 2014 20:23:31 GMT+0800 (China Standard Time)

Moving to 1.14. We should debate the ReQL aspects of it, in case we decide to do it.

Very roughly:

r.table('users').avg('age').changes()
r.table('users').group('city').avg('age').changes()
r.table('users').group('city').reduce(reduction_fn, anireduction_fn).changes()

#2542 has some discussion of what this should return. I think:

We shouldn't persist things on disk for v1. If the query dies, they rerun and recompute the first value.
We should come prepackaged with the inverse functions for common aggregators.

Sam Hughes · Answer 10 · Fri Jun 20 2014 05:43:33 GMT+0800 (China Standard Time)

That doesn't seem like incremental map reduce to me. I would expect it to involve some kind of persisted thing that you can query on any connection, not something that requires a live changefeed to be open.

Andrew Pendleton · Answer 11 · Fri Jun 20 2014 05:48:40 GMT+0800 (China Standard Time)

@srh yes, that was what I meant when I asked about it on HN last year; it's what Couch has and refers to by that name. You basically register a map/reduce job and its results are kept up to date automatically as the records it ran over are changed/deleted/added to.

Slava Akhmechet · Answer 12 · Mon Jun 23 2014 23:06:07 GMT+0800 (China Standard Time)

For the moment, I'm shooting for something very different with this feature. The spec above would give people the ability to get instantaneous updates to values of many different types of queries. They wouldn't persist on restart (or even on disconnect), but for a variety of reasons, I think that's sufficient for v1. It would require a bunch of infrastructure work, and would leave the door open to later include persistent incremental map/reduce support (where the user would save the query), but I think we should do that separately in future releases. I've opened #2587 to track that.

Michael Lucy · Answer 13 · Tue Jun 24 2014 06:35:34 GMT+0800 (China Standard Time)

It's worth noting that for large tables, doing this without persistence will make it very hard to track changes on large tables unless you're 100% sure the client will never get disconnected.

Slava Akhmechet · Answer 14 · Tue Jun 24 2014 06:51:26 GMT+0800 (China Standard Time)

I think that's ok. We wouldn't market this feature as incremental map/reduce -- we'd market it as instantaneous updates to the result of a query (well, not quite like this, we'd have to find better wording, but you get the idea). Essentially, you pay the price of running a query, and then get any updates in realtime. We'll phrase it in such a way as to not confuse people, and not have them expect things that aren't quite true yet.

We can then deal with large tables in #2587.

Slava Akhmechet · Answer 15 · Tue Jun 24 2014 06:53:47 GMT+0800 (China Standard Time)

Related to #2542.

Slava Akhmechet · Answer 16 · Tue Jun 24 2014 08:56:41 GMT+0800 (China Standard Time)

Talked to @mlucy in person:

He convinced me persistence is more important than I thought
He's going to write up a proposal

Daniel Mewes · Answer 17 · Wed Jun 25 2014 03:38:30 GMT+0800 (China Standard Time)

What I find interesting about CouchDB's implementation is that they don't require an inverse reduction function.

Instead they seem to store the intermediate reduction results. For example if your reduction function is (x, y) -> x+y and you have documents [1, 2, 3, 4], they would store the following results:

a: 1 + 2 -> 3
b: 3 + 4 -> 7
c: a + b -> 10

(i.e. build a binary reduction tree and store the intermediate results at each node)

Now if we let's say update the first value from 1 to 10, they only have to recompute log_2(n) results:

a': 10 + 2 -> 12
c': a' + b' -> 19

This makes it more convenient for the user, since they don't have to come up with an inverse function (which might also be wrong, which we can't detect).

It's definitely more difficult to implement.
I believe right now our reduction tree is heavily unbalanced also? That doesn't matter so much right now (unless the reduction function is extremely expensive), but would have to be changed to work incrementally without an inverse reduction.

Michael Lucy · Answer 18 · Wed Jun 25 2014 14:33:18 GMT+0800 (China Standard Time)

So, there are advantages to both designs. Here are my thoughts on maintaining a tree:

Pros:

Harder for the user to mess up.
Works with a wider variety of functions (e.g. min and max).

Cons:

It takes up space. Like, O(n) in the size of the table. This can add up quickly.
It's more work to implement.
It's usually slower.

I would lean toward the inverse solution because it's easier, it scales better, and I would guess most people will be using our aggregators (sum, avg, etc.) rather than their own, and we can provide the inverse functions for them.

For min and max, we can say that we only offer live min and max on an index. (We first need to implement min and max on an index, but we should do that anyway.)

When we eventually make sample an aggregator, we can solve the inverse problem for sample by just making it fast in all cases (if we implement constant-time count by storing counts in our btree nodes, this won't be all that hard).

Michael Lucy · Answer 19 · Wed Jun 25 2014 14:52:45 GMT+0800 (China Standard Time)

Alright, here's my proposal for how this should work. I think that long-term you should be able to call changes on any operation which is done entirely on the shards (to be more specific, anything that produces a lazy_datum_stream_t with a bunch of deferred computation). In particular, I think you should be able to call changes on any selection + transformations + optional terminal, including single-row selections (see #2542).

I think that we basically need three interfaces:

For a stream selection, you get {old_val: ..., new_val: ...} objects. If a document enters the selection, old_val is nil, and if a document leaves the selection, new_val is nil. (r.table('test').changes() already follows this interface, and can be considered a stream selection on the whole table.)
For a single selection, you just the values of the object. There should be an optarg to indicate whether you want the current value of the object to be the first value in the feed (I think this is usually what you want). If you write r.table('test').get(1).changes(), you shouldn't get this:
```
{old_val: {id: 1, a: 1}, new_val: {id: 1, a: 2}}
{old_val: {id: 1, a: 2}, new_val: {id: 1, a: 3}}
{old_val: {id: 1, a: 3}, new_val: {id: 1, a: 4}}
```
but rather just:
```
{id: 1, a: 1} # POSSIBLY included based on an optarg
{id: 1, a: 2}
{id: 1, a: 3}
{id: 1, a: 4}
```
For a stream selection with a terminal, I think you should get a stream of plain objects representing the value of that aggregation after doing a live map-reduce thing (the same as you get on a single selection). Once again, I think there should be an optarg indicating whether you want the initial value of the aggregation, which you almost always do.

So the following would all be legal:

r.table('test').changes() -- supported now.
r.table('test').filter(...).changes() -- same format as above
r.table('test').get(1).changes(include_first: true) -- stream of plain objects, no new_val/old_val
r.table('test').filter(...)['a'].sum().changes(include_first: true)` -- same format as above

That's basically three separate but interconnected features. I think that .get(...).changes() should be implemented first, because it's relatively independent of the other two and can get through CR on its own.

I'm not yet sure what to do about persistence. I think persistence is relatively independent of the change-streaming feature, so we should do the change-streaming stuff and then add persistence later.

If we want persistence, I think a good interface would be something like:

r.table('test').filter(...)['a'].sum().persist('sum_a')
r.table('test').persist_list('sum_a')
r.table('test').persist_status('sum_a')
r.table('test').persist_drop('sum_a')
r.table('test').persist_get('sum_a')

r.table('test').persist_get('sum_a').changes()
r.table('test').filter(...)['a'].sum().persist('sum_a').changes() # create and subscribe in one go

Basically persistent aggregations would be sort of like indexes on the table.

Michael Lucy · Answer 20 · Wed Jun 25 2014 14:53:48 GMT+0800 (China Standard Time)

Also, I'm not sure yet how to let people specify the reverse of their aggregation. In particular, I'm not sure which of these is uglier:

r.table('test').filter(...).reduce{|a,b| ...}.changes(reverse: lambda {|acc, o| ...})
r.table('test').filter(...).reduce(reverse: lambda {|acc, o| ...}){|a,b| ...}

Andrew Pendleton · Answer 21 · Thu Jun 26 2014 02:04:35 GMT+0800 (China Standard Time)

Our applications often use map-reduce to build complex summary structures with various kinds of aggregates about a particular kind of data, including things that aren't reversible (top-ten lists by multiple criteria, start and stop dates for various kinds of activity, etc.). Aside from being much easier for developers to reason about, not requiring a reverse function makes this feature much more useful, since I think lots of things people use map-reduce for aren't trivially reversible. It also seems like it's not necessarily O(n) storage; there are various decisions you could make about whether to store the results of all intermediate maps and reduces, or only some or all of the reduces, at the expense of having to do a bit more recomputation upon change (e.g., rerunning the map operation on the leaves of the subtree that changed). But I think you could end up with O(log n) storage depending on what you decided there.

Michael Lucy · Answer 22 · Thu Jun 26 2014 06:24:39 GMT+0800 (China Standard Time)

It also seems like it's not necessarily O(n) storage; there are various decisions you could make about whether to store the results of all intermediate maps and reduces, or only some or all of the reduces, at the expense of having to do a bit more recomputation upon change (e.g., rerunning the map operation on the leaves of the subtree that changed). But I think you could end up with O(log n) storage depending on what you decided there.

You'd have to do O(n/log(n)) work to recompute the root node if you used O(log(n)) storage, which would be too slow.

There are definitely advantages to not doing reversible map/reduce.

(Also, there's no reason we can't have both long-term -- we can do reversible map/reduce when possible, and fall back to tree-rebuilding if no reverse function is provided. That still leaves the question of which to implement first, though.)

Slava Akhmechet · Answer 23 · Thu Jun 26 2014 09:32:43 GMT+0800 (China Standard Time)

I'm on board with the overall direction of the thing.

Thoughts/questions on the API:

For filters, when do documents enter the change feed? When only the new value passes the filter, only the old value, both, or either?
What's the API for saying stream.group(...).avg(...).changes() (EDIT: or rather, what does the user get in this case)?
I'm not sure we want to drop new_val/old_val for single document change feeds or aggregation changefeeds (though I see why that's cleaner). Here is an example -- suppose I want to show a ticker for average game scores of various groups of my users, and as it changes I want to show how much they rose or dropped (in absolute values and percentages). Sort of how stocks have red/green down/up arrows. If we don't give people old values, this might be relatively hard to do because when the new value is received, the server may no longer have the old value, and the user would have to care about keeping track of the old value manually. I think that might be quite annoying, and we might be better off just keeping new_val/old_val syntax everywhere.

Thoughts on the implementation:

I agree that both a reverse function implementation and storing intermediate values implementation are valuable. I think that we should start with the reverse function implementation because it's (a) efficient, and (b) gives us a chance in hell to ship this soon. We can add the intermediary values implementation later and let the user explicitly switch between the two.
wrt to where to specify the reverse function, we should consider #1725. For example, if we specify it in changes, how would that work with multiple aggregations? Also, we'll not only need to specify it in the changes command, but also when we persist queries. It seems to me like it's way better to specify it in the reduce command for all these reasons.

Thoughts on query persistence:

The direction makes sense, but the overall API seems kind of messy to me. I'd change the name to "views" (which is generally accepted in the db world). I also think it's weird to attach them to tables like that (or may be not?). Could we open a separate issue to discuss this? I think we can hold off on finalizing that API for now, start working on the incremental changefeeds, and discuss the persistence API in parallel (since it doesn't really impact anything else).

Michael Lucy · Answer 24 · Thu Jun 26 2014 09:44:51 GMT+0800 (China Standard Time)

For filters, when do documents enter the change feed? When only the new value passes the filter, only the old value, both, or either?

If the old value doesn't match the filter and the new value does, old_val is nil and new_val is the new value. The opposite is true if the old value matches the filter and the new value doesn't. If they both match the filter, you get both old_val and new_val. If neither matches the filter, you don't get a message.

What's the API for saying stream.group(...).avg(...).changes()

Forgot to specify that. I think you should get a stream of objects such that shallow-merging the original result of the group call with those objects produces the updated group value. For example, if you have r.table('test').group('a').sum('b').changes(), and the initial value is {foo: 10, bar: 20}, and you update a document {foo: 1} to be {baz: 12}, you get the document {foo: 9, baz: 12} (the two groups that changed, but not the group bar).

(When I say "object", I mean "whatever the driver turns the group pseudotype into".)

Michael Lucy · Answer 25 · Thu Jun 26 2014 09:47:03 GMT+0800 (China Standard Time)

I'm not sure we want to drop new_val/old_val for single document change feeds or aggregation changefeeds (though I see why that's cleaner).

You could do this by replacing r.table('test').get(1).changes() with r.table('test').between(1, 1, right_bound: 'closed').changes().

We could also provide an optarg to change the format.

How important do you think this is? I feel like in most cases you don't want the old_val/new_val format for single-row selections, so maybe we should make it possible if you need it but not the default.

Michael Lucy · Answer 26 · Thu Jun 26 2014 09:47:55 GMT+0800 (China Standard Time)

wrt to where to specify the reverse function, we should consider #1725. For example, if we specify it in changes, how would that work with multiple aggregations? Also, we'll not only need to specify it in the changes command, but also when we persist queries. It seems to me like it's way better to specify it in the reduce command for all these reasons.

Alright, you've sold me. We'll make it an optarg on reduce.

Michael Lucy · Answer 27 · Thu Jun 26 2014 09:50:23 GMT+0800 (China Standard Time)

I opened #2613 for discussing persistence.

Michael Lucy · Answer 28 · Thu Jul 03 2014 14:30:26 GMT+0800 (China Standard Time)

After talking with Slava, I think we should try having single selections and aggregations return a stream of plain objects rather than old_val/new_val pairs. If it turns out to be confusing, we can switch.

I also think we should definitely have an optarg to return the initial value in those cases. I'd like to propose we call it return_first, return_initial, or return_current.

Slava Akhmechet · Answer 29 · Thu Jul 03 2014 14:54:37 GMT+0800 (China Standard Time)

I'm really happy with the spec 👍

(Also, I agree we need the optarg, and I'd call it return_initial.)

Michel · Answer 30 · Fri Jul 04 2014 01:18:50 GMT+0800 (China Standard Time)

It may sound stupid, but why don't we always return the initial value?

All the use case I can think of require the initial value (like building a dashboard, keeping a table of stats etc.). Or am I missing an important use case?

Michael Lucy · Answer 31 · Sun Jul 06 2014 09:18:42 GMT+0800 (China Standard Time)

It may sound stupid, but why don't we always return the initial value?

I think return_initial should default to true. I think we should give people a way to turn it off because technically the initial value isn't a change, and they might only want to see changes.

Michael Lucy · Answer 32 · Sun Jul 06 2014 09:24:02 GMT+0800 (China Standard Time)

After thinking about this, one major problem with return a stream of plain objects is errors. (We currently represent errors with {error: ...} in changes.)

I can think of a few solutions:

We have a stream of objects or strings, and errors are just included in the stream as a string.
Errors in point streams throw an error in the client rather than being part of the stream. (This isn't so bad, because they should really never happen -- it's easy to fall behind if you call changes on a table, but if you fall 100,000 elements behind a call to changes on a single object then you're probably doing something wrong.)
Use the {old_val: ..., new_val: ...} syntax like normal. If we do this, there's the question of how to represent the initial value (which, like @neumino, is what I think most people want):
- Option one: have the first object be a plain object.
- Option two: have the first object be an object like {initial_val: ...}.
- Option three: have the first object be an object like {new_val: ...} (i.e. old_val is missing rather than nil).
- Option four: have the first object be an object like {old_val: ..., new_val: ...} where old_val and new_val are the same.

I prefer option three, or failing that, option 4, because then you can write r.table('test').get(0).changes()['new_val'] to get a stream of plain objects (except for errors, which will produce a runtime error in the client because you try to access a field that doesn't exist).

Michael Lucy · Answer 33 · Sun Jul 06 2014 09:24:25 GMT+0800 (China Standard Time)

I don't really like any of those options. @coffeemug, @neumino -- what do you think?

Slava Akhmechet · Answer 34 · Mon Jul 07 2014 11:35:50 GMT+0800 (China Standard Time)

I think return_initial should default to true. I think we should give people a way to turn it off because technically the initial value isn't a change, and they might only want to see changes.

I'm in favor of dropping return_initial altogether. The user can always add .skip(1) and just drop the first value if they don't want to see it. I think that's much more elegant than adding a new optarg.

what do you think?

In single row cases I think we should not be reporting errors at all (wait, hear me out!) In case of a stream changefeed we had to report an error because going over 100k elements means you've missed changes to objects you might never see again. But in case of a datum changefeed, an overfill merely means you've missed some changes in time, but you haven't missed the actual object. I think that once the array gets to 100k elements, we should treat it as a queue and simply start dropping old changes from the array. This seems like perfectly reasonable and unsurprising behavior to me.

In case of .group(f).changes() we can still return an object {error: ...} instead of a group pseudotype, couldn't we? The driver could then determine what to do with that object (probably throw a client-side exception the user could recover from and continue).

Failing all that, I think {old_val: ..., new_val: ...} syntax isn't so bad at all. I understand your reservations about it, but I think it would be a minor issue that we could stomach. I'd return the first value via {old_val: x, new_val: x} -- that seems quite natural to me.

I'd suggest leaving this part of the proposal open until we implement it and play with the feature. I suspect the right path will be much more clearly illuminated then.

Slava Akhmechet · Answer 35 · Tue Jul 08 2014 08:51:01 GMT+0800 (China Standard Time)

Marking as settled after talking to @mlucy. To clarify:

We'll drop return_initial and always provide the initial value
For t.get(x).changes() we'll discard intermediate state, which eliminates a ton of traffic and need for overflow error handling (lord, is that beautiful). If it turns out people don't want that, we can later add an optarg to turn this behavior off.
for .group(f).changes() we can omit intermediate values for the aggregation case, and use {old_val: ..., new_val: ...} for non-aggregate cases

Michel · Answer 36 · Tue Jul 08 2014 09:03:23 GMT+0800 (China Standard Time)

Is this valid?

r.table("timeseries").between(r.now().sub(60), r.now(), {index: "date"}).avg("value").changes()

(I want the average of the point for the last minute)

Michael Lucy · Answer 37 · Wed Jul 09 2014 05:27:25 GMT+0800 (China Standard Time)

With our current semantics, that would give you the the average value for the minute before you ran the query. r.now means "when the query is received by the server", not "when this chunk of code runs".

Slava Akhmechet · Answer 38 · Thu Jul 17 2014 04:03:04 GMT+0800 (China Standard Time)

Some feedback on this command from the Meteor team:

r.table('users').orderBy(r.desc('age')).limit(2).changes() is really important to support and comes up in almost every app, including hello world. @mlucy tells me this is easy to do as long as we require an index in orderBy. There is a question of how to return data here (i.e. how do we indicate that an item has moved in the list, for example).
Atomically getting the current resultset + changes is very important.
Documenting guarantees is important. For example, if the object's value changes and then changes back to the original value, is the user guaranteed to see the change? We should document guarantees clearly.
It's important to clearly specify the relationship between when acks to writes occur and when changes are pushed onto a feed. It's also important to see the ack before the change on a given connection (or to be able to correlate them between different connections).
How do I easily and efficiently subscribe to changes of multiple objects or multiple different changes? What happens when I call t.getAll(x, y, z).changes()?

We don't have to solve all these problems right away, but I wanted to document this so we can make incremental improvements over time.

Slava Akhmechet · Answer 39 · Sat Jul 19 2014 08:25:57 GMT+0800 (China Standard Time)

@mlucy -- a few additional questions about the spec.

The point changefeeds will merge unread events (which is great), but can we extend this to range changefeeds too? For example, if I have a feed on a table, and there are multiple things happening to a particular document before I had the opportunity to read it, should/could we merge those events? (when a user reads, I think old_val should be set to the first old values, and new_val should be set to the last new value).

Also, how hard would it be to amend the spec with a merge optarg (defaulting to true). If merge is true, we merge events as specified. If it's false, we report each event. If it's set to an integer value (in milliseconds), we merge events, but only within that window. I think it would be a really useful feature and would make for a much stronger announcement, but I don't want to amend the spec last minute if this is hard to implement. (Also, merge may not be the best of names)

Michael Lucy · Answer 40 · Sun Jul 20 2014 10:21:20 GMT+0800 (China Standard Time)

@coffeemug -- I opened #2726 and #2727 to track those. This issue is so big they're likely to be lost forever if they aren't moved into their own.

I'd like to not think about either of those until the changes we've already settled are done, except insofar as the meteor one includes things to keep in mind while implementing the current spec.

Michael Lucy · Answer 41 · Sun Jul 20 2014 10:22:19 GMT+0800 (China Standard Time)

Adding a merge optarg isn't incredibly difficult, but it also isn't a trivial fix, it's a new feature that would take development time, so I think it should be its own ReQL proposal (which is where I put it).

Michael Lucy · Answer 42 · Wed Jul 23 2014 09:35:54 GMT+0800 (China Standard Time)

Point changefeeds are in next (CR 1803). I'm trying to merge into next in pieces every time there's a completed bit of functionality to reduce the number of merge conflicts.

Nate Wienert · Answer 43 · Tue Aug 11 2015 06:02:23 GMT+0800 (China Standard Time)

Any word on the progress of this?

Daniel Mewes · Answer 44 · Tue Aug 11 2015 06:08:11 GMT+0800 (China Standard Time)

@natew This is definitely coming, but we don't have a specific release for it yet.

Slava Akhmechet · Answer 45 · Tue Aug 11 2015 06:11:19 GMT+0800 (China Standard Time)

I think we'll be able to get this into 2.3 (although as @danielmewes mentioned, there is no ETA yet). Tentatively, I'd expect this 2-4 months from now.

Phyo Arkar Lwin · Answer 46 · Fri Sep 04 2015 00:13:07 GMT+0800 (China Standard Time)

So current solution for count is to keep track of counts in separate field?
for example , a new message arrived to room , it need to be inserted to count field of channel table. RIght? where changes is listening.

Daniel Mewes · Answer 47 · Fri Sep 04 2015 03:23:05 GMT+0800 (China Standard Time)

@v3ss0n You mean as a work-around until we implement incremental map/reduce?
Yeah, that sounds about right. We're also still planning to implement constant-time count (#152), but that too is a few releases away.

Phyo Arkar Lwin · Answer 48 · Sat Sep 05 2015 00:38:54 GMT+0800 (China Standard Time)

Thanks i am doing that way ,. but that will cause 2 writes . If there any better way?

Alisson Cavalcante Agiani · Answer 49 · Fri Feb 12 2016 10:24:57 GMT+0800 (China Standard Time)

For inspiration: https://docs.influxdata.com/influxdb/v0.10/query_language/continuous_queries/

Daniel Mewes · Answer 50 · Sat Feb 13 2016 05:10:43 GMT+0800 (China Standard Time)

Interesting. I need to take a look at them.
On first sight, InfluxDB's continuous queries seem more specialized, but would be a very interesting use case.

Cody Lundquist · Answer 51 · Fri Mar 25 2016 06:18:29 GMT+0800 (China Standard Time)

Any movement on this?

Daniel Mewes · Answer 52 · Fri Mar 25 2016 10:20:05 GMT+0800 (China Standard Time)

@meenie Not yet. This is probably going to be the next thing after #3997 .

Daniel Mewes · Answer 53 · Sat May 28 2016 03:30:47 GMT+0800 (China Standard Time)

This has actually become really easy, due to the addition of the fold term. We can now formulate this entirely as a set of rewrites.

Given a changefeed of the form:

stream.reduce(f).changes({includeInitial: <II>, includeStates: <IS>})

Assume that for the given f, we know the following properties:

<f_BASE> the initial accumulator for f
<f_APPLY> a function from the accumulator and an element in the input table to a new accumulator
<f_UNAPPLY> the inverse of <f_APPLY> in the accumulator
<f_EMIT> generates a result value of the reduction from the current accumulator

Now the query can be rewritten into:

stream.changes({includeInitial: true, includeStates: true}).fold(
  {f_acc: <f_BASE>, is_initialized: false},
  function(acc, el) {
    var f_acc = acc('f_acc');
    var new_f_acc = r.branch(el.hasFields("old_val"), <f_UNAPPLY>(f_acc, el('old_val')), f_acc).do(function(un_f_acc) {
        return r.branch(el.hasFields("new_val"), <f_APPLY>(un_f_acc, el('new_val')), un_f_acc);
      });
    var new_is_initialized = acc('is_initialized').or(el.hasFields('state').and(el('state').eq('ready')));
    return {f_acc: new_f_acc, is_initialized: new_is_initialized};
  },
  {emit: function(old_acc, el, new_acc) {
    var old_f_acc = old_acc('f_acc');
    var new_f_acc = new_acc('f_acc');
    var old_val = f_EMIT(old_f_acc);
    var new_val = f_EMIT(new_f_acc);
    // We handle the 'ready' state separately below
    var emit_state = r.expr(IS).and(el.hasFields('state')).and(r.expr(II).not().or(el('state').ne('ready')));
    var emit_update = old_acc('is_initialized').and(old_val.ne(new_val));
    var emit_initial = r.expr(<II>).and(old_acc('is_initialized').not().and(new_acc('is_initialized')));
    return r.branch(
      emit_state, [el],
      emit_update, [{'old_val': old_val, 'new_val': new_val}],
      emit_initial, r.branch(<IS>, [{'new_val': new_val}, {state: "ready"}], [{'new_val': new_val}]),
      []
    );
  }})

For example for count():

<f_BASE> = 0
<f_APPLY> = function(acc, el) { return acc.add(1); }
<f_UNAPPLY> = function(acc, el) { return acc.sub(1); }
<f_EMIT> = function(acc) { return acc; }

Or for avg():

<f_BASE> = {c: 0, sum: 0}
<f_APPLY> = function(acc, el) { return {c: acc('c').add(1), sum: acc('sum').add(el) }; }
<f_UNAPPLY> = function(acc, el) { return {c: acc('c').sub(1), sum: acc('sum').sub(el) }; }
<f_EMIT> = function(acc) { return acc('sum').div(acc('c')); } (plus some sort of handling for empty input sets that we need to come up with)

The main disadvantage of this implementation is that it doesn't distribute the reduction anymore and causes additional network traffic between the shards and the parsing node. This can be improved through a bit of special code.

It would be amazing I think if we could get the slow version into 2.4, at least for the built-in terms (such as count, sum and avg). We can mark it as a "preview" feature, since it doesn't have the full scalability yet that you'd expect from RethinkDB, and then ship the optimized version with 2.5.

Cody Lundquist · Answer 54 · Sat May 28 2016 04:19:11 GMT+0800 (China Standard Time)

@danielmewes The initial reduction would be distributed, so the speed is still there. Subsequent reductions would be very very cheap, so it wouldn't matter if they were distributed or not. This is pretty awesome :).

Daniel Mewes · Answer 55 · Sat May 28 2016 04:40:47 GMT+0800 (China Standard Time)

                                                                                  Right, that would be the optimized version with an initial reduction. The simple one that works entirely as a rewrite to changes.fold would not distribute the initial reduction.

Cody Lundquist · Answer 56 · Sat May 28 2016 04:54:37 GMT+0800 (China Standard Time)

Oh, sorry, I misunderstood. Awesome either way :).

Daniel Mewes · Answer 57 · Sat Jun 11 2016 05:20:35 GMT+0800 (China Standard Time)

I'm removing the API_settled tag because a bunch of things have changed about the changes() API.

I think for our built-in reductions (count, sum, avg) the API is straight-forward. We basically just allow calling changes on those terms.

The open question is whether we want to support this for arbitrary reductions in 2.4, and if so, how the API of reduce should be extended.

I'm in favor of supporting this for reduce in general, as it's not going to add much extra work as far as I can tell.

The following APIs have been suggested above:

r.table('test').filter(...).reduce{|a,b| ...}.changes(reverse: lambda {|acc, o| ...})
r.table('test').filter(...).reduce(reverse: lambda {|acc, o| ...}){|a,b| ...}

I really don't like the first one, but I like the second one.

Phyo Arkar Lwin · Answer 58 · Sat Jun 11 2016 05:59:57 GMT+0800 (China Standard Time)

I Love the second one!
so at 2.4 , we can do counts on changefeed? that will be awesome!

Daniel Mewes · Answer 59 · Sat Jun 11 2016 06:03:55 GMT+0800 (China Standard Time)

so at 2.4 , we can do counts on changefeed? that will be awesome!

Yes, that's the idea :)
The initial count will run a bit slower in a changefeed for 2.4 compared to running just the count without changes. But we can optimize this in the future to get almost the same performance.

Cody Lundquist · Answer 60 · Sat Jun 11 2016 06:08:22 GMT+0800 (China Standard Time)

If I could do a changefeed on an aggregate that calculates NPS, that would be amazing. I have data that looks something like this:

const npsData = [
  {
    "component_id": 1,
    "number": 10
  },
  {
    "component_id": 1,
    "number": 10
  },
  {
    "component_id": 2,
    "number": 8
  },
  {
    "component_id": 1,
    "number": 9
  },
  {
    "component_id": 2,
    "number": 2
  },
  ...
];

And my query looks something like this:

r.expr(npsData)
  .group('component_id', 'number').count()
  .ungroup()
  .map((row) => {
    const number = row('group').nth(1);
    const ret = r.expr({
      component_id: row('group').nth(0),
      distribution: [{number: number, total: row('reduction')}],
      total_answers: row('reduction'),
      detractors: 0,
      passives: 0,
      promoters: 0
    });

    return r.branch(
      number.eq(9).or(number.eq(10)),
      ret.merge({promoters: ret('promoters').add(row('reduction'))}),
      number.eq(7).or(number.eq(8)),
      ret.merge({passives: ret('passives').add(row('reduction'))}),
      ret.merge({detractors: ret('detractors').add(row('reduction'))})
    );
  })
  .group('component_id')
  .reduce((left, right) => ({
    component_id: left('component_id'),
    total_answers: left('total_answers').add(right('total_answers')),
    detractors: left('detractors').add(right('detractors')),
    passives: left('passives').add(right('passives')),
    promoters: left('promoters').add(right('promoters')),
    distribution: left('distribution').add(right('distribution')),
  }))
  .do((datum) => {
    const passivesPercentage = datum('passives').div(datum('total_answers')).mul(100);
    const promotersPercentage = datum('promoters').div(datum('total_answers')).mul(100);
    const detractorsPercentage = datum('detractors').div(datum('total_answers')).mul(100);
    return {
      distribution: datum('distribution'),
      passives_percentage: passivesPercentage,
      promoters_percentage: promotersPercentage,
      detractors_percentage: detractorsPercentage,
      score: promotersPercentage.sub(detractorsPercentage)
    };
  })
  .ungroup()
  .map(row => ({
    component_id: row('group'),
    distribution: row('reduction')('distribution'),
    passives_percentage: row('reduction')('passives_percentage'),
    promoters_percentage: row('reduction')('promoters_percentage'),
    detractors_percentage: row('reduction')('detractors_percentage'),
    score: row('reduction')('score')
  }));

Could the above be achieved with this new api?

Phyo Arkar Lwin · Answer 61 · Sat Jun 11 2016 06:09:26 GMT+0800 (China Standard Time)

@danielmewes Excellent !

Daniel Mewes · Answer 62 · Sat Jun 11 2016 06:11:41 GMT+0800 (China Standard Time)

@meenie It depends on whether or not you can express it as a reduce operation and whether there's an efficient "reverse" function that updates the query result when a document gets removed from the input set.

Cody Lundquist · Answer 63 · Sat Jun 11 2016 06:14:39 GMT+0800 (China Standard Time)

@danielmewes So I wouldn't be able to use group()? And I'd have to do those counts manually using reduce? And ya, I believe you could reverse the above because you keep track of the distribution.

Daniel Mewes · Answer 64 · Sat Jun 11 2016 06:17:16 GMT+0800 (China Standard Time)

@meenie You might be able to rewrite the grouping into a reduction, in which case it would work. You would basically maintain an object {group1: group1Value, group2: group2Value, ...} in the reduction. This might become inefficient if there are a lot of groups, because a new object will be constructed every time the reduction function is called.

Cody Lundquist · Answer 65 · Sat Jun 11 2016 06:21:39 GMT+0800 (China Standard Time)

@danielmewes: Ya, that makes total sense. For now, we need every bit of efficiency we can get, so I'll be experimenting with rewriting out queries to use changefeed's, but won't utilise this in production until it's on parity with speed.

Josh Kuhn · Answer 66 · Sat Jun 11 2016 06:37:24 GMT+0800 (China Standard Time)

r.table('test').filter(...).reduce{|a,b| ...}.changes(reverse: lambda {|acc, o| ...})
r.table('test').filter(...).reduce(reverse: lambda {|acc, o| ...}){|a,b| ...}

Since the reverse function is only needed because it's a changefeed, it seems like the first one makes sense. But then again it's not clear which function it's reversing if they're separated. I guess it feels kind of wrong to me to require something extra when you're doing .changes vs. a normal query.

(to clarify, I know why we have to do it in this case, but it pulls me towards option 1 over option 2)

Daniel Mewes · Answer 67 · Sat Jun 11 2016 06:54:12 GMT+0800 (China Standard Time)

I guess it feels kind of wrong to me to require something extra when you're doing .changes vs. a normal query.

The way I think of it is that this is like needing to have the {index: ...} optarg for orderBy if you want to have a changefeed on it. I can see that this is slightly different because the reverse option will not have any effect unless you open a changefeed, but I don't feel like that's a big issue.

I don't like the first syntax because it seems limiting and different from what we do anywhere else.

What if in the future we allow changefeeds on queries that contain multiple reduce operations (for example within a subquery)? Specifying the function in changes would not work for that.

Or what if you have a query that looks like this: tbl.reduce(...).do(...).changes()? Ignoring the fact that we currently don't support do in changefeeds (which we totally should), it becomes much less obvious what the reverse argument to changes actually applies to and how it works. Does it get applied to the value after or before the do?

Josh Kuhn · Answer 68 · Sat Jun 11 2016 07:12:21 GMT+0800 (China Standard Time)

Yeah, it seems like the best way is to provide the reverse function to the reduce term. I'm assuming if you don't tack on .changes the reverse optarg will just be a no-op (vs. erroring)?

Daniel Mewes · Answer 69 · Sat Jun 11 2016 10:16:24 GMT+0800 (China Standard Time)

I'm assuming if you don't tack on .changes the reverse optarg will just be a no-op (vs. erroring)?

Yeah that's what I thought. That way you can run the same query with and without .changes.

Michael Lucy · Answer 70 · Mon Jun 13 2016 09:25:51 GMT+0800 (China Standard Time)

There's a lot of discussion above. Here's my understanding the
current proposal:

You can write any of these:
- stream.avg(...).changes()
- stream.sum(...).changes()
- stream.count(...).changes()
- stream.reduce(..., reverse: FUNC).changes()
In particular, it doesn't need to be on a selection;
.map.reduce.changes etc. are legal. (We should probably support
concat_map.reduce.changes even though we don't yet support
.concat_map.changes, since it's easy.)

A few other things:

Should we support .coerce_to(...).changes()? The most common
would probably be .coerce_to('array').changes(), where we'd
re-send the whole array every time it changes. There's an argument
that coercing to an array is a terminal, so it might be more
consistent to support it.
Should we support .group.reduce.changes? There's no real
technical limitation, it would be almost as easy as not supporting
it. If so, should we also take this opportunity to support
.group.changes?
How should we handle reductions over nothing?
(E.g. r.table('test').avg('foo').changes() when test changes
from empty to non-empty -- what's old_val?) Currently sum and
count return 0 on empty streams, while avg and reduce produce
an error.
- We should probably use 0 as the "nothing" value for the
  terminals that return it on an empty stream.
- One option would be to just use nil as the "nothing" value for
  all other terminals.
- Another option would be to error by default, but to let people
  write e.g. .avg('foo').default(whatever).changes() to specify
  it explicitly.

Also, on the subject of implementation, it probably wouldn't actually be
that hard to do it the efficient way where we do chunks of reductions
on the shards and only ship the aggregates over. It would only speed
up the initial computation, but it would probably speed it up a lot.
(The reason I don't think it would be particularly hard is that we're
already only tracking timestamps on a per-read-transaction basis, so
we wouldn't lose any fidelity if we attached a terminal to the reads
we ship over and got back a pre-aggregated value alongside the
stamps.)

Phyo Arkar Lwin · Answer 71 · Mon Jun 13 2016 16:38:28 GMT+0800 (China Standard Time)

.coerce_to(...).changes() looks like a convenience function , looks good but should be optional .
What i found exciting is group.changes and group.reduce.changes .

Daniel Mewes · Answer 72 · Tue Jun 14 2016 11:54:32 GMT+0800 (China Standard Time)

@mlucy Thanks for the summary of the current proposal. That matches what I had in mind for 2.4.

I'd like to add coerce_to(...).changes() from your suggestions to this as it seems trivial to do.

The three extensions that you're suggesting

.coerce_to(...).changes()
.group(...).reduce(...).changes()
more efficient implementation for the initial result

all sound really cool to me.

As far as I can tell, 1 would be easy to implement even as a pure fold-based rewrite at least to coerceTo('array'). We can just keep the current array in the accumulator. I'm not sure if there are any other types that we allow coercing to from a stream? 'string' maybe? Most likely those would also be easy to support, and we would still maintain the array but then just call a final .coerceTo(...) on the accumulator array before emitting a value. In any case doing this will be O(n) in the number of results, but that's expected since the output per change is already of that size.

My impression is that 2 (group.reduce.changes) is a bit more involved in terms of having to figure out how to represent added and removed groups in the output stream.

Since we have limited remaining development resources for 2.4 considering the other things we are working on, my suggestion would be that we agree on a minimal proposal, and keep extensions 2 and 3 out of the proposal for now. If we end up having extra time, we can still implement the more efficient algorithm (3) or discuss grouped changefeeds separately.

How should we handle reductions over nothing?

Great question.
My opinion is that we should emit them as the value null for avg and reduce (and 0 for sum and count). Our current changefeeds already use null to indicate the absence of a value. I think this would fit pretty nicely.
Reporting them as errors sounds nice on paper, but I think in practice it will be a much bigger pain for our users to handle.

Daniel Mewes · Answer 73 · Tue Jun 14 2016 11:56:59 GMT+0800 (China Standard Time)

Also I would like to add that I'm extremely excited about this feature! It's going to be so amazing :-)

Phyo Arkar Lwin · Answer 74 · Tue Jun 14 2016 15:51:12 GMT+0800 (China Standard Time)

Can we have 2 and 3 in 2.5? :D

Daniel Mewes · Answer 75 · Wed Jun 15 2016 01:31:20 GMT+0800 (China Standard Time)

@v3ss0n I think so :)

Michael Lucy · Answer 76 · Wed Jun 15 2016 02:24:36 GMT+0800 (China Standard Time)

@danielmewes -- leaving 2 and 3 for later sounds good to me. I don't think 2's representation would be a particularly involved discussion, though -- I was imagining we'd just emit the entire grouped data every time it changed (so {old_val: {grp: red, grp2: red2}, new_val: {grp: red}}). If we wanted to support plain old .group.changes that would require thinking a little about the format, though.

On coerce_to, I think coerce_to('array') and coerce_to('object') are the only ones that can take a stream.

Daniel Mewes · Answer 77 · Wed Jun 22 2016 08:34:43 GMT+0800 (China Standard Time)

Marking settled as:

You can write any of these:
- stream.avg(...).changes()
- stream.sum(...).changes()
- stream.count(...).changes()
- stream.reduce(..., reverse: FUNC).changes()
- stream.coerceTo('array').changes()
- stream.coerceTo('object').changes()
In particular, it doesn't need to be on a selection;
.map.reduce.changes etc. are legal. (We should probably support
concat_map.reduce.changes even though we don't yet support
.concat_map.changes, since it's easy.)

For 2.4 we will implement the slower variant that performs the initial reduction on the parsing node rather than distributing it.

Marshall Cottrell · Answer 78 · Thu Jun 23 2016 02:13:00 GMT+0800 (China Standard Time)

This is gonna be great! Note that supporting .coerce_to('array').changes() solves the use-case I had for #3719, so I would say we probably don't need anything from that proposal anymore. I also prefer these semantics over the optarg from the other proposal.

Catherine David · Answer 79 · Sat Jul 30 2016 06:28:58 GMT+0800 (China Standard Time)

This is in review 3714, except for coerce_to("array")

Daniel Vergeylen · Answer 80 · Mon Oct 17 2016 16:58:55 GMT+0800 (China Standard Time)

As this has been transferred to milestone 2.4-polish (for obvious reasons), I just wanted to emphase what @danielmewes has written at this comment as a work-around for the time being. I suggest the following to support the other common aggregation operations:

sum(): Very similar to count() but we increment by the value of el instead of 1.
- <f_BASE> = 0
- <f_APPLY> = function(acc, el) { return acc.add(el); }
- <f_UNAPPLY> = function(acc, el) { return acc.sub(el); }
- <f_EMIT> = function(acc) { return acc; }
min():
- <f_BASE> = Number.POSITIVE_INFINITY
- <f_APPLY> = function(acc, el) { return (el < acc ? el : acc); }
- <f_UNAPPLY> = function(acc, el) { return acc; }
- <f_EMIT> = function(acc) { return acc; }
max():
- <f_BASE> = Number.NEGATIVE_INFINITY
- <f_APPLY> = function(acc, el) { return (el > acc ? el : acc); }
- <f_UNAPPLY> = function(acc, el) { return acc; }
- <f_EMIT> = function(acc) { return acc; }
avg(): Similar to what Daniel did, but I would handle empty sets with a one liner like this:
- <f_EMIT> = function(acc) { return (acc('c').neq(0) ? acc('sum').div(acc('c')) : 0); }

Any thoughts on this would be appreciated. 😃