rethinkdb / rethinkdb

The open-source database for the realtime web.

Home Page:https://rethinkdb.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support incremental map/reduce

coffeemug opened this issue · comments

We had this on our radar for a while, but didn't have an issue to track it. Since some people have been asking for an official issue to track, I'm adding this to GitHub.

I'm going to write up a specific proposal a bit later. This is in backlog as it's obviously a medium-term priority feature.

How bad would it be if incremental map/reduce jobs could only be registered on a single table? If we limited ourselves to that this would actually become a much simpler problem to solve in the backend.

Hmm, I have to think about it. It might be sufficient for most real use cases, but at first glance this makes me feel really uneasy. One thing MongoDB does that people find extremely annoying is introduce features that don't work with other features of the database. For example, they have plenty of collection types that cannot be sharded, which makes the user experience really frustrating since it moves the burden from developers to users. People can't just use the features they want and have the confidence that they will work.

(I don't think it necessarily means we should restrict functionality, just that this tradeoff comes with connotations frustrating for users, so we should think carefully before we choose to do it this way)

Like, a single incremental job could only operate on data from a single table? Or that each database could only have one table on which incremental jobs could be registered?

For the use case I had in mind when I asked the HN question that I think prompted this ticket, the former would be acceptable, but the latter wouldn't. I have no idea what other uses people have in mind, though.

@apendleton the former was what I meant. To give people an idea of how much easier it is I think I could probably do the one table case in less than a month while the general case would probably as many as 4-5 months all told. I think it's a feature about on the same scale as secondary indexes which took about that long.

I actually think we should ship the one table case sometime semi soon (I think post 2.0 probably), gauge people's response to it and then expand from there. Also if we had triggers then the one table limitation really wouldn't be that bad because you could write triggers to push data from where you wanted it in to your single table where it would get map reduced. We'd add some sugar on top of that it and could actually be really nice. On top of that a lot of the features for managing tables you actually want for this incremental map reduce stuff as well. Redundancy will make the computed value more available. Sharding can help it scale better.

@jdoliner -- when you get the chance could you explain the design for each of the options? (i.e. single-table option and multi-table option). I'd like to understand how you envision each version would work and where the factor of four-five difference in complexity comes from. (Obviously not urgent since we aren't doing this now)

@jdoliner yeah, that all sounds awesome. We have a currently-Postgres database that I think I want to eventually replace with something-not-Postgres TBD, and we build aggregates on a whole bunch of tables that are very expensive to compute, and currently recompute everything from scratch on updates (additions, deletions, and changes of records). There's occasionally inter-table stuff, but 90% or more is probably single-table, and if we could change records and get new aggregates without recomputing everything from scratch, that would be a huge boon. I think you're absolutely right, too, that that use case is probably much more common than a complicated multi-table MR situation, and that in the interest of 80%-20% solutions, getting the single-table case out the door early would be totally worthwhile.

@coffeemug actually having thought about this a bit more I think the multi-table version of this is less a question of being complicated from an engineering perspective and more a question of being algorithmically untenable. You can imagine even a fairly simple multi-table mapreduce such as: table.eq_join("foo", table2).map(...).reduce(...) is very complicated to keep track of in an incremental way, and in a lot of cases downright impossible. Even a single row change in table2 can conceivably change the value of every single piece of data going in to the map reduce so there's really just no efficient way to compute an incremental view without basically rerunning the map reduce for every change to table2. We could maybe make some optimizations that made it more efficient if you had an approximately one-to-one join (which is probably the most common case) but that's going to be a big undertaking that only works in very specific cases which will be hard to explain to people and will behave very badly when it's used outside of those cases. Furthermore if people start using arbitrary sub expressions like: table.map(lambda x: query_on_table2(x)).reduce(lambda x,y: query_on_table3(x,y)) then all bets are off.

I definitely agree that it's annoying to have 2 features which aren't compatible but I think the reality is this is a situation where you can't sugarcoat the algorithmic limitations. Doing so is just going to lead to people bumping in to the limitations as exponential runtimes which is clearly a lot worse.

My conclusion here is that the easier thing of having map reduce jobs rooted on a single table is actually the right thing to do because it's something I know we can make fast and make in to a very useful feature. Also it's really a very doable thing because almost all the annoying parts of it are already written and "working" for secondary indexes and the system was designed to be easily extended to support incremental map reduce. I'll write up a full proposal for this at some point in the near future.

@jdoliner -- this makes a lot of sense. I changed my mind -- I think it's ok to make this feature work on a single table and it's probably ok to never make it work on multiple tables. Actually, we already have precedent where we do best effort on non-deterministic queries, and generally handle them differently from deterministic ones. This would be no different.

Moving to 1.14. We should debate the ReQL aspects of it, in case we decide to do it.

Very roughly:

r.table('users').avg('age').changes()
r.table('users').group('city').avg('age').changes()
r.table('users').group('city').reduce(reduction_fn, anireduction_fn).changes()

#2542 has some discussion of what this should return. I think:

  • We shouldn't persist things on disk for v1. If the query dies, they rerun and recompute the first value.
  • We should come prepackaged with the inverse functions for common aggregators.

That doesn't seem like incremental map reduce to me. I would expect it to involve some kind of persisted thing that you can query on any connection, not something that requires a live changefeed to be open.

@srh yes, that was what I meant when I asked about it on HN last year; it's what Couch has and refers to by that name. You basically register a map/reduce job and its results are kept up to date automatically as the records it ran over are changed/deleted/added to.

For the moment, I'm shooting for something very different with this feature. The spec above would give people the ability to get instantaneous updates to values of many different types of queries. They wouldn't persist on restart (or even on disconnect), but for a variety of reasons, I think that's sufficient for v1. It would require a bunch of infrastructure work, and would leave the door open to later include persistent incremental map/reduce support (where the user would save the query), but I think we should do that separately in future releases. I've opened #2587 to track that.

It's worth noting that for large tables, doing this without persistence will make it very hard to track changes on large tables unless you're 100% sure the client will never get disconnected.

I think that's ok. We wouldn't market this feature as incremental map/reduce -- we'd market it as instantaneous updates to the result of a query (well, not quite like this, we'd have to find better wording, but you get the idea). Essentially, you pay the price of running a query, and then get any updates in realtime. We'll phrase it in such a way as to not confuse people, and not have them expect things that aren't quite true yet.

We can then deal with large tables in #2587.

Related to #2542.

Talked to @mlucy in person:

  • He convinced me persistence is more important than I thought
  • He's going to write up a proposal

What I find interesting about CouchDB's implementation is that they don't require an inverse reduction function.

Instead they seem to store the intermediate reduction results. For example if your reduction function is (x, y) -> x+y and you have documents [1, 2, 3, 4], they would store the following results:

a: 1 + 2 -> 3
b: 3 + 4 -> 7
c: a + b -> 10

(i.e. build a binary reduction tree and store the intermediate results at each node)

Now if we let's say update the first value from 1 to 10, they only have to recompute log_2(n) results:

a': 10 + 2 -> 12
c': a' + b' -> 19

This makes it more convenient for the user, since they don't have to come up with an inverse function (which might also be wrong, which we can't detect).

It's definitely more difficult to implement.
I believe right now our reduction tree is heavily unbalanced also? That doesn't matter so much right now (unless the reduction function is extremely expensive), but would have to be changed to work incrementally without an inverse reduction.

So, there are advantages to both designs. Here are my thoughts on maintaining a tree:

Pros:

  • Harder for the user to mess up.
  • Works with a wider variety of functions (e.g. min and max).

Cons:

  • It takes up space. Like, O(n) in the size of the table. This can add up quickly.
  • It's more work to implement.
  • It's usually slower.

I would lean toward the inverse solution because it's easier, it scales better, and I would guess most people will be using our aggregators (sum, avg, etc.) rather than their own, and we can provide the inverse functions for them.

For min and max, we can say that we only offer live min and max on an index. (We first need to implement min and max on an index, but we should do that anyway.)

When we eventually make sample an aggregator, we can solve the inverse problem for sample by just making it fast in all cases (if we implement constant-time count by storing counts in our btree nodes, this won't be all that hard).

Alright, here's my proposal for how this should work. I think that long-term you should be able to call changes on any operation which is done entirely on the shards (to be more specific, anything that produces a lazy_datum_stream_t with a bunch of deferred computation). In particular, I think you should be able to call changes on any selection + transformations + optional terminal, including single-row selections (see #2542).

I think that we basically need three interfaces:

  • For a stream selection, you get {old_val: ..., new_val: ...} objects. If a document enters the selection, old_val is nil, and if a document leaves the selection, new_val is nil. (r.table('test').changes() already follows this interface, and can be considered a stream selection on the whole table.)

  • For a single selection, you just the values of the object. There should be an optarg to indicate whether you want the current value of the object to be the first value in the feed (I think this is usually what you want). If you write r.table('test').get(1).changes(), you shouldn't get this:

    {old_val: {id: 1, a: 1}, new_val: {id: 1, a: 2}}
    {old_val: {id: 1, a: 2}, new_val: {id: 1, a: 3}}
    {old_val: {id: 1, a: 3}, new_val: {id: 1, a: 4}}
    

    but rather just:

    {id: 1, a: 1} # POSSIBLY included based on an optarg
    {id: 1, a: 2}
    {id: 1, a: 3}
    {id: 1, a: 4}
    
  • For a stream selection with a terminal, I think you should get a stream of plain objects representing the value of that aggregation after doing a live map-reduce thing (the same as you get on a single selection). Once again, I think there should be an optarg indicating whether you want the initial value of the aggregation, which you almost always do.


So the following would all be legal:

  • r.table('test').changes() -- supported now.
  • r.table('test').filter(...).changes() -- same format as above
  • r.table('test').get(1).changes(include_first: true) -- stream of plain objects, no new_val/old_val
  • r.table('test').filter(...)['a'].sum().changes(include_first: true)` -- same format as above

That's basically three separate but interconnected features. I think that .get(...).changes() should be implemented first, because it's relatively independent of the other two and can get through CR on its own.


I'm not yet sure what to do about persistence. I think persistence is relatively independent of the change-streaming feature, so we should do the change-streaming stuff and then add persistence later.

If we want persistence, I think a good interface would be something like:

r.table('test').filter(...)['a'].sum().persist('sum_a')
r.table('test').persist_list('sum_a')
r.table('test').persist_status('sum_a')
r.table('test').persist_drop('sum_a')
r.table('test').persist_get('sum_a')

r.table('test').persist_get('sum_a').changes()
r.table('test').filter(...)['a'].sum().persist('sum_a').changes() # create and subscribe in one go

Basically persistent aggregations would be sort of like indexes on the table.

Also, I'm not sure yet how to let people specify the reverse of their aggregation. In particular, I'm not sure which of these is uglier:

r.table('test').filter(...).reduce{|a,b| ...}.changes(reverse: lambda {|acc, o| ...})
r.table('test').filter(...).reduce(reverse: lambda {|acc, o| ...}){|a,b| ...}

Our applications often use map-reduce to build complex summary structures with various kinds of aggregates about a particular kind of data, including things that aren't reversible (top-ten lists by multiple criteria, start and stop dates for various kinds of activity, etc.). Aside from being much easier for developers to reason about, not requiring a reverse function makes this feature much more useful, since I think lots of things people use map-reduce for aren't trivially reversible. It also seems like it's not necessarily O(n) storage; there are various decisions you could make about whether to store the results of all intermediate maps and reduces, or only some or all of the reduces, at the expense of having to do a bit more recomputation upon change (e.g., rerunning the map operation on the leaves of the subtree that changed). But I think you could end up with O(log n) storage depending on what you decided there.

It also seems like it's not necessarily O(n) storage; there are various decisions you could make about whether to store the results of all intermediate maps and reduces, or only some or all of the reduces, at the expense of having to do a bit more recomputation upon change (e.g., rerunning the map operation on the leaves of the subtree that changed). But I think you could end up with O(log n) storage depending on what you decided there.

You'd have to do O(n/log(n)) work to recompute the root node if you used O(log(n)) storage, which would be too slow.

There are definitely advantages to not doing reversible map/reduce.

(Also, there's no reason we can't have both long-term -- we can do reversible map/reduce when possible, and fall back to tree-rebuilding if no reverse function is provided. That still leaves the question of which to implement first, though.)

I'm on board with the overall direction of the thing.

Thoughts/questions on the API:

  • For filters, when do documents enter the change feed? When only the new value passes the filter, only the old value, both, or either?
  • What's the API for saying stream.group(...).avg(...).changes() (EDIT: or rather, what does the user get in this case)?
  • I'm not sure we want to drop new_val/old_val for single document change feeds or aggregation changefeeds (though I see why that's cleaner). Here is an example -- suppose I want to show a ticker for average game scores of various groups of my users, and as it changes I want to show how much they rose or dropped (in absolute values and percentages). Sort of how stocks have red/green down/up arrows. If we don't give people old values, this might be relatively hard to do because when the new value is received, the server may no longer have the old value, and the user would have to care about keeping track of the old value manually. I think that might be quite annoying, and we might be better off just keeping new_val/old_val syntax everywhere.

Thoughts on the implementation:

  • I agree that both a reverse function implementation and storing intermediate values implementation are valuable. I think that we should start with the reverse function implementation because it's (a) efficient, and (b) gives us a chance in hell to ship this soon. We can add the intermediary values implementation later and let the user explicitly switch between the two.
  • wrt to where to specify the reverse function, we should consider #1725. For example, if we specify it in changes, how would that work with multiple aggregations? Also, we'll not only need to specify it in the changes command, but also when we persist queries. It seems to me like it's way better to specify it in the reduce command for all these reasons.

Thoughts on query persistence:

  • The direction makes sense, but the overall API seems kind of messy to me. I'd change the name to "views" (which is generally accepted in the db world). I also think it's weird to attach them to tables like that (or may be not?). Could we open a separate issue to discuss this? I think we can hold off on finalizing that API for now, start working on the incremental changefeeds, and discuss the persistence API in parallel (since it doesn't really impact anything else).

For filters, when do documents enter the change feed? When only the new value passes the filter, only the old value, both, or either?

If the old value doesn't match the filter and the new value does, old_val is nil and new_val is the new value. The opposite is true if the old value matches the filter and the new value doesn't. If they both match the filter, you get both old_val and new_val. If neither matches the filter, you don't get a message.

What's the API for saying stream.group(...).avg(...).changes()

Forgot to specify that. I think you should get a stream of objects such that shallow-merging the original result of the group call with those objects produces the updated group value. For example, if you have r.table('test').group('a').sum('b').changes(), and the initial value is {foo: 10, bar: 20}, and you update a document {foo: 1} to be {baz: 12}, you get the document {foo: 9, baz: 12} (the two groups that changed, but not the group bar).

(When I say "object", I mean "whatever the driver turns the group pseudotype into".)

I'm not sure we want to drop new_val/old_val for single document change feeds or aggregation changefeeds (though I see why that's cleaner).

You could do this by replacing r.table('test').get(1).changes() with r.table('test').between(1, 1, right_bound: 'closed').changes().

We could also provide an optarg to change the format.

How important do you think this is? I feel like in most cases you don't want the old_val/new_val format for single-row selections, so maybe we should make it possible if you need it but not the default.

wrt to where to specify the reverse function, we should consider #1725. For example, if we specify it in changes, how would that work with multiple aggregations? Also, we'll not only need to specify it in the changes command, but also when we persist queries. It seems to me like it's way better to specify it in the reduce command for all these reasons.

Alright, you've sold me. We'll make it an optarg on reduce.

I opened #2613 for discussing persistence.

After talking with Slava, I think we should try having single selections and aggregations return a stream of plain objects rather than old_val/new_val pairs. If it turns out to be confusing, we can switch.

I also think we should definitely have an optarg to return the initial value in those cases. I'd like to propose we call it return_first, return_initial, or return_current.

I'm really happy with the spec 👍

(Also, I agree we need the optarg, and I'd call it return_initial.)

It may sound stupid, but why don't we always return the initial value?

All the use case I can think of require the initial value (like building a dashboard, keeping a table of stats etc.). Or am I missing an important use case?

It may sound stupid, but why don't we always return the initial value?

I think return_initial should default to true. I think we should give people a way to turn it off because technically the initial value isn't a change, and they might only want to see changes.

After thinking about this, one major problem with return a stream of plain objects is errors. (We currently represent errors with {error: ...} in changes.)

I can think of a few solutions:

  • We have a stream of objects or strings, and errors are just included in the stream as a string.
  • Errors in point streams throw an error in the client rather than being part of the stream. (This isn't so bad, because they should really never happen -- it's easy to fall behind if you call changes on a table, but if you fall 100,000 elements behind a call to changes on a single object then you're probably doing something wrong.)
  • Use the {old_val: ..., new_val: ...} syntax like normal. If we do this, there's the question of how to represent the initial value (which, like @neumino, is what I think most people want):
    • Option one: have the first object be a plain object.
    • Option two: have the first object be an object like {initial_val: ...}.
    • Option three: have the first object be an object like {new_val: ...} (i.e. old_val is missing rather than nil).
    • Option four: have the first object be an object like {old_val: ..., new_val: ...} where old_val and new_val are the same.

I prefer option three, or failing that, option 4, because then you can write r.table('test').get(0).changes()['new_val'] to get a stream of plain objects (except for errors, which will produce a runtime error in the client because you try to access a field that doesn't exist).

I don't really like any of those options. @coffeemug, @neumino -- what do you think?

I think return_initial should default to true. I think we should give people a way to turn it off because technically the initial value isn't a change, and they might only want to see changes.

I'm in favor of dropping return_initial altogether. The user can always add .skip(1) and just drop the first value if they don't want to see it. I think that's much more elegant than adding a new optarg.

what do you think?

In single row cases I think we should not be reporting errors at all (wait, hear me out!) In case of a stream changefeed we had to report an error because going over 100k elements means you've missed changes to objects you might never see again. But in case of a datum changefeed, an overfill merely means you've missed some changes in time, but you haven't missed the actual object. I think that once the array gets to 100k elements, we should treat it as a queue and simply start dropping old changes from the array. This seems like perfectly reasonable and unsurprising behavior to me.

In case of .group(f).changes() we can still return an object {error: ...} instead of a group pseudotype, couldn't we? The driver could then determine what to do with that object (probably throw a client-side exception the user could recover from and continue).

Failing all that, I think {old_val: ..., new_val: ...} syntax isn't so bad at all. I understand your reservations about it, but I think it would be a minor issue that we could stomach. I'd return the first value via {old_val: x, new_val: x} -- that seems quite natural to me.

I'd suggest leaving this part of the proposal open until we implement it and play with the feature. I suspect the right path will be much more clearly illuminated then.

Marking as settled after talking to @mlucy. To clarify:

  • We'll drop return_initial and always provide the initial value
  • For t.get(x).changes() we'll discard intermediate state, which eliminates a ton of traffic and need for overflow error handling (lord, is that beautiful). If it turns out people don't want that, we can later add an optarg to turn this behavior off.
  • for .group(f).changes() we can omit intermediate values for the aggregation case, and use {old_val: ..., new_val: ...} for non-aggregate cases

Is this valid?

r.table("timeseries").between(r.now().sub(60), r.now(), {index: "date"}).avg("value").changes()

(I want the average of the point for the last minute)

With our current semantics, that would give you the the average value for the minute before you ran the query. r.now means "when the query is received by the server", not "when this chunk of code runs".

Some feedback on this command from the Meteor team:

  • r.table('users').orderBy(r.desc('age')).limit(2).changes() is really important to support and comes up in almost every app, including hello world. @mlucy tells me this is easy to do as long as we require an index in orderBy. There is a question of how to return data here (i.e. how do we indicate that an item has moved in the list, for example).
  • Atomically getting the current resultset + changes is very important.
  • Documenting guarantees is important. For example, if the object's value changes and then changes back to the original value, is the user guaranteed to see the change? We should document guarantees clearly.
  • It's important to clearly specify the relationship between when acks to writes occur and when changes are pushed onto a feed. It's also important to see the ack before the change on a given connection (or to be able to correlate them between different connections).
  • How do I easily and efficiently subscribe to changes of multiple objects or multiple different changes? What happens when I call t.getAll(x, y, z).changes()?

We don't have to solve all these problems right away, but I wanted to document this so we can make incremental improvements over time.

@mlucy -- a few additional questions about the spec.

The point changefeeds will merge unread events (which is great), but can we extend this to range changefeeds too? For example, if I have a feed on a table, and there are multiple things happening to a particular document before I had the opportunity to read it, should/could we merge those events? (when a user reads, I think old_val should be set to the first old values, and new_val should be set to the last new value).

Also, how hard would it be to amend the spec with a merge optarg (defaulting to true). If merge is true, we merge events as specified. If it's false, we report each event. If it's set to an integer value (in milliseconds), we merge events, but only within that window. I think it would be a really useful feature and would make for a much stronger announcement, but I don't want to amend the spec last minute if this is hard to implement. (Also, merge may not be the best of names)

@coffeemug -- I opened #2726 and #2727 to track those. This issue is so big they're likely to be lost forever if they aren't moved into their own.

I'd like to not think about either of those until the changes we've already settled are done, except insofar as the meteor one includes things to keep in mind while implementing the current spec.

Adding a merge optarg isn't incredibly difficult, but it also isn't a trivial fix, it's a new feature that would take development time, so I think it should be its own ReQL proposal (which is where I put it).

Point changefeeds are in next (CR 1803). I'm trying to merge into next in pieces every time there's a completed bit of functionality to reduce the number of merge conflicts.

Any word on the progress of this?

@natew This is definitely coming, but we don't have a specific release for it yet.

I think we'll be able to get this into 2.3 (although as @danielmewes mentioned, there is no ETA yet). Tentatively, I'd expect this 2-4 months from now.

So current solution for count is to keep track of counts in separate field?
for example , a new message arrived to room , it need to be inserted to count field of channel table. RIght? where changes is listening.

@v3ss0n You mean as a work-around until we implement incremental map/reduce?
Yeah, that sounds about right. We're also still planning to implement constant-time count (#152), but that too is a few releases away.

Thanks i am doing that way ,. but that will cause 2 writes . If there any better way?

Interesting. I need to take a look at them.
On first sight, InfluxDB's continuous queries seem more specialized, but would be a very interesting use case.

Any movement on this?

@meenie Not yet. This is probably going to be the next thing after #3997 .

This has actually become really easy, due to the addition of the fold term. We can now formulate this entirely as a set of rewrites.

Given a changefeed of the form:

stream.reduce(f).changes({includeInitial: <II>, includeStates: <IS>})

Assume that for the given f, we know the following properties:

  • <f_BASE> the initial accumulator for f
  • <f_APPLY> a function from the accumulator and an element in the input table to a new accumulator
  • <f_UNAPPLY> the inverse of <f_APPLY> in the accumulator
  • <f_EMIT> generates a result value of the reduction from the current accumulator

Now the query can be rewritten into:

stream.changes({includeInitial: true, includeStates: true}).fold(
  {f_acc: <f_BASE>, is_initialized: false},
  function(acc, el) {
    var f_acc = acc('f_acc');
    var new_f_acc = r.branch(el.hasFields("old_val"), <f_UNAPPLY>(f_acc, el('old_val')), f_acc).do(function(un_f_acc) {
        return r.branch(el.hasFields("new_val"), <f_APPLY>(un_f_acc, el('new_val')), un_f_acc);
      });
    var new_is_initialized = acc('is_initialized').or(el.hasFields('state').and(el('state').eq('ready')));
    return {f_acc: new_f_acc, is_initialized: new_is_initialized};
  },
  {emit: function(old_acc, el, new_acc) {
    var old_f_acc = old_acc('f_acc');
    var new_f_acc = new_acc('f_acc');
    var old_val = f_EMIT(old_f_acc);
    var new_val = f_EMIT(new_f_acc);
    // We handle the 'ready' state separately below
    var emit_state = r.expr(IS).and(el.hasFields('state')).and(r.expr(II).not().or(el('state').ne('ready')));
    var emit_update = old_acc('is_initialized').and(old_val.ne(new_val));
    var emit_initial = r.expr(<II>).and(old_acc('is_initialized').not().and(new_acc('is_initialized')));
    return r.branch(
      emit_state, [el],
      emit_update, [{'old_val': old_val, 'new_val': new_val}],
      emit_initial, r.branch(<IS>, [{'new_val': new_val}, {state: "ready"}], [{'new_val': new_val}]),
      []
    );
  }})

For example for count():

  • <f_BASE> = 0
  • <f_APPLY> = function(acc, el) { return acc.add(1); }
  • <f_UNAPPLY> = function(acc, el) { return acc.sub(1); }
  • <f_EMIT> = function(acc) { return acc; }

Or for avg():

  • <f_BASE> = {c: 0, sum: 0}
  • <f_APPLY> = function(acc, el) { return {c: acc('c').add(1), sum: acc('sum').add(el) }; }
  • <f_UNAPPLY> = function(acc, el) { return {c: acc('c').sub(1), sum: acc('sum').sub(el) }; }
  • <f_EMIT> = function(acc) { return acc('sum').div(acc('c')); } (plus some sort of handling for empty input sets that we need to come up with)

The main disadvantage of this implementation is that it doesn't distribute the reduction anymore and causes additional network traffic between the shards and the parsing node. This can be improved through a bit of special code.

It would be amazing I think if we could get the slow version into 2.4, at least for the built-in terms (such as count, sum and avg). We can mark it as a "preview" feature, since it doesn't have the full scalability yet that you'd expect from RethinkDB, and then ship the optimized version with 2.5.

@danielmewes The initial reduction would be distributed, so the speed is still there. Subsequent reductions would be very very cheap, so it wouldn't matter if they were distributed or not. This is pretty awesome :).

                                                                                  Right, that would be the optimized version with an initial reduction. The simple one that works entirely as a rewrite to changes.fold would not distribute the initial reduction.

Oh, sorry, I misunderstood. Awesome either way :).

I'm removing the API_settled tag because a bunch of things have changed about the changes() API.

I think for our built-in reductions (count, sum, avg) the API is straight-forward. We basically just allow calling changes on those terms.

The open question is whether we want to support this for arbitrary reductions in 2.4, and if so, how the API of reduce should be extended.

I'm in favor of supporting this for reduce in general, as it's not going to add much extra work as far as I can tell.

The following APIs have been suggested above:

r.table('test').filter(...).reduce{|a,b| ...}.changes(reverse: lambda {|acc, o| ...})
r.table('test').filter(...).reduce(reverse: lambda {|acc, o| ...}){|a,b| ...}

I really don't like the first one, but I like the second one.

I Love the second one!
so at 2.4 , we can do counts on changefeed? that will be awesome!

so at 2.4 , we can do counts on changefeed? that will be awesome!

Yes, that's the idea :)
The initial count will run a bit slower in a changefeed for 2.4 compared to running just the count without changes. But we can optimize this in the future to get almost the same performance.

If I could do a changefeed on an aggregate that calculates NPS, that would be amazing. I have data that looks something like this:

const npsData = [
  {
    "component_id": 1,
    "number": 10
  },
  {
    "component_id": 1,
    "number": 10
  },
  {
    "component_id": 2,
    "number": 8
  },
  {
    "component_id": 1,
    "number": 9
  },
  {
    "component_id": 2,
    "number": 2
  },
  ...
];

And my query looks something like this:

r.expr(npsData)
  .group('component_id', 'number').count()
  .ungroup()
  .map((row) => {
    const number = row('group').nth(1);
    const ret = r.expr({
      component_id: row('group').nth(0),
      distribution: [{number: number, total: row('reduction')}],
      total_answers: row('reduction'),
      detractors: 0,
      passives: 0,
      promoters: 0
    });

    return r.branch(
      number.eq(9).or(number.eq(10)),
      ret.merge({promoters: ret('promoters').add(row('reduction'))}),
      number.eq(7).or(number.eq(8)),
      ret.merge({passives: ret('passives').add(row('reduction'))}),
      ret.merge({detractors: ret('detractors').add(row('reduction'))})
    );
  })
  .group('component_id')
  .reduce((left, right) => ({
    component_id: left('component_id'),
    total_answers: left('total_answers').add(right('total_answers')),
    detractors: left('detractors').add(right('detractors')),
    passives: left('passives').add(right('passives')),
    promoters: left('promoters').add(right('promoters')),
    distribution: left('distribution').add(right('distribution')),
  }))
  .do((datum) => {
    const passivesPercentage = datum('passives').div(datum('total_answers')).mul(100);
    const promotersPercentage = datum('promoters').div(datum('total_answers')).mul(100);
    const detractorsPercentage = datum('detractors').div(datum('total_answers')).mul(100);
    return {
      distribution: datum('distribution'),
      passives_percentage: passivesPercentage,
      promoters_percentage: promotersPercentage,
      detractors_percentage: detractorsPercentage,
      score: promotersPercentage.sub(detractorsPercentage)
    };
  })
  .ungroup()
  .map(row => ({
    component_id: row('group'),
    distribution: row('reduction')('distribution'),
    passives_percentage: row('reduction')('passives_percentage'),
    promoters_percentage: row('reduction')('promoters_percentage'),
    detractors_percentage: row('reduction')('detractors_percentage'),
    score: row('reduction')('score')
  }));

Could the above be achieved with this new api?

@meenie It depends on whether or not you can express it as a reduce operation and whether there's an efficient "reverse" function that updates the query result when a document gets removed from the input set.

@danielmewes So I wouldn't be able to use group()? And I'd have to do those counts manually using reduce? And ya, I believe you could reverse the above because you keep track of the distribution.

@meenie You might be able to rewrite the grouping into a reduction, in which case it would work. You would basically maintain an object {group1: group1Value, group2: group2Value, ...} in the reduction. This might become inefficient if there are a lot of groups, because a new object will be constructed every time the reduction function is called.

@danielmewes: Ya, that makes total sense. For now, we need every bit of efficiency we can get, so I'll be experimenting with rewriting out queries to use changefeed's, but won't utilise this in production until it's on parity with speed.

r.table('test').filter(...).reduce{|a,b| ...}.changes(reverse: lambda {|acc, o| ...})
r.table('test').filter(...).reduce(reverse: lambda {|acc, o| ...}){|a,b| ...}

Since the reverse function is only needed because it's a changefeed, it seems like the first one makes sense. But then again it's not clear which function it's reversing if they're separated. I guess it feels kind of wrong to me to require something extra when you're doing .changes vs. a normal query.

(to clarify, I know why we have to do it in this case, but it pulls me towards option 1 over option 2)

I guess it feels kind of wrong to me to require something extra when you're doing .changes vs. a normal query.

The way I think of it is that this is like needing to have the {index: ...} optarg for orderBy if you want to have a changefeed on it. I can see that this is slightly different because the reverse option will not have any effect unless you open a changefeed, but I don't feel like that's a big issue.

I don't like the first syntax because it seems limiting and different from what we do anywhere else.

What if in the future we allow changefeeds on queries that contain multiple reduce operations (for example within a subquery)? Specifying the function in changes would not work for that.

Or what if you have a query that looks like this: tbl.reduce(...).do(...).changes()? Ignoring the fact that we currently don't support do in changefeeds (which we totally should), it becomes much less obvious what the reverse argument to changes actually applies to and how it works. Does it get applied to the value after or before the do?

Yeah, it seems like the best way is to provide the reverse function to the reduce term. I'm assuming if you don't tack on .changes the reverse optarg will just be a no-op (vs. erroring)?

I'm assuming if you don't tack on .changes the reverse optarg will just be a no-op (vs. erroring)?

Yeah that's what I thought. That way you can run the same query with and without .changes.

There's a lot of discussion above. Here's my understanding the
current proposal:

  • You can write any of these:
    • stream.avg(...).changes()
    • stream.sum(...).changes()
    • stream.count(...).changes()
    • stream.reduce(..., reverse: FUNC).changes()
  • In particular, it doesn't need to be on a selection;
    .map.reduce.changes etc. are legal. (We should probably support
    concat_map.reduce.changes even though we don't yet support
    .concat_map.changes, since it's easy.)

A few other things:

  • Should we support .coerce_to(...).changes()? The most common
    would probably be .coerce_to('array').changes(), where we'd
    re-send the whole array every time it changes. There's an argument
    that coercing to an array is a terminal, so it might be more
    consistent to support it.
  • Should we support .group.reduce.changes? There's no real
    technical limitation, it would be almost as easy as not supporting
    it. If so, should we also take this opportunity to support
    .group.changes?
  • How should we handle reductions over nothing?
    (E.g. r.table('test').avg('foo').changes() when test changes
    from empty to non-empty -- what's old_val?) Currently sum and
    count return 0 on empty streams, while avg and reduce produce
    an error.
    • We should probably use 0 as the "nothing" value for the
      terminals that return it on an empty stream.
    • One option would be to just use nil as the "nothing" value for
      all other terminals.
    • Another option would be to error by default, but to let people
      write e.g. .avg('foo').default(whatever).changes() to specify
      it explicitly.

Also, on the subject of implementation, it probably wouldn't actually be
that hard to do it the efficient way where we do chunks of reductions
on the shards and only ship the aggregates over. It would only speed
up the initial computation, but it would probably speed it up a lot.
(The reason I don't think it would be particularly hard is that we're
already only tracking timestamps on a per-read-transaction basis, so
we wouldn't lose any fidelity if we attached a terminal to the reads
we ship over and got back a pre-aggregated value alongside the
stamps.)

.coerce_to(...).changes() looks like a convenience function , looks good but should be optional .
What i found exciting is group.changes and group.reduce.changes .

@mlucy Thanks for the summary of the current proposal. That matches what I had in mind for 2.4.

I'd like to add coerce_to(...).changes() from your suggestions to this as it seems trivial to do.

The three extensions that you're suggesting

  1. .coerce_to(...).changes()
  2. .group(...).reduce(...).changes()
  3. more efficient implementation for the initial result

all sound really cool to me.

As far as I can tell, 1 would be easy to implement even as a pure fold-based rewrite at least to coerceTo('array'). We can just keep the current array in the accumulator. I'm not sure if there are any other types that we allow coercing to from a stream? 'string' maybe? Most likely those would also be easy to support, and we would still maintain the array but then just call a final .coerceTo(...) on the accumulator array before emitting a value. In any case doing this will be O(n) in the number of results, but that's expected since the output per change is already of that size.

My impression is that 2 (group.reduce.changes) is a bit more involved in terms of having to figure out how to represent added and removed groups in the output stream.

Since we have limited remaining development resources for 2.4 considering the other things we are working on, my suggestion would be that we agree on a minimal proposal, and keep extensions 2 and 3 out of the proposal for now. If we end up having extra time, we can still implement the more efficient algorithm (3) or discuss grouped changefeeds separately.

How should we handle reductions over nothing?

Great question.
My opinion is that we should emit them as the value null for avg and reduce (and 0 for sum and count). Our current changefeeds already use null to indicate the absence of a value. I think this would fit pretty nicely.
Reporting them as errors sounds nice on paper, but I think in practice it will be a much bigger pain for our users to handle.

Also I would like to add that I'm extremely excited about this feature! It's going to be so amazing :-)

Can we have 2 and 3 in 2.5? :D

@v3ss0n I think so :)

@danielmewes -- leaving 2 and 3 for later sounds good to me. I don't think 2's representation would be a particularly involved discussion, though -- I was imagining we'd just emit the entire grouped data every time it changed (so {old_val: {grp: red, grp2: red2}, new_val: {grp: red}}). If we wanted to support plain old .group.changes that would require thinking a little about the format, though.

On coerce_to, I think coerce_to('array') and coerce_to('object') are the only ones that can take a stream.

Marking settled as:

  • You can write any of these:
    • stream.avg(...).changes()
    • stream.sum(...).changes()
    • stream.count(...).changes()
    • stream.reduce(..., reverse: FUNC).changes()
    • stream.coerceTo('array').changes()
    • stream.coerceTo('object').changes()
  • In particular, it doesn't need to be on a selection;
    .map.reduce.changes etc. are legal. (We should probably support
    concat_map.reduce.changes even though we don't yet support
    .concat_map.changes, since it's easy.)

For 2.4 we will implement the slower variant that performs the initial reduction on the parsing node rather than distributing it.

This is gonna be great! Note that supporting .coerce_to('array').changes() solves the use-case I had for #3719, so I would say we probably don't need anything from that proposal anymore. I also prefer these semantics over the optarg from the other proposal.

This is in review 3714, except for coerce_to("array")

As this has been transferred to milestone 2.4-polish (for obvious reasons), I just wanted to emphase what @danielmewes has written at this comment as a work-around for the time being. I suggest the following to support the other common aggregation operations:

  • sum(): Very similar to count() but we increment by the value of el instead of 1.
    • <f_BASE> = 0
    • <f_APPLY> = function(acc, el) { return acc.add(el); }
    • <f_UNAPPLY> = function(acc, el) { return acc.sub(el); }
    • <f_EMIT> = function(acc) { return acc; }
  • min():
    • <f_BASE> = Number.POSITIVE_INFINITY
    • <f_APPLY> = function(acc, el) { return (el < acc ? el : acc); }
    • <f_UNAPPLY> = function(acc, el) { return acc; }
    • <f_EMIT> = function(acc) { return acc; }
  • max():
    • <f_BASE> = Number.NEGATIVE_INFINITY
    • <f_APPLY> = function(acc, el) { return (el > acc ? el : acc); }
    • <f_UNAPPLY> = function(acc, el) { return acc; }
    • <f_EMIT> = function(acc) { return acc; }
  • avg(): Similar to what Daniel did, but I would handle empty sets with a one liner like this:
    • <f_EMIT> = function(acc) { return (acc('c').neq(0) ? acc('sum').div(acc('c')) : 0); }

Any thoughts on this would be appreciated. 😃