microsoft / Trill

Trill is a single-node query processor for temporal or streaming data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with QueryContainer

nsulikowski opened this issue · comments

See the following code:

        [Fact]
        public void QueryContainer_MultipleOutputs_Test()
        {
            var container = new QueryContainer();

            var asset_subject = new Subject<StreamEvent<IONRecord_Struct>>();
            var assets_input = container.RegisterInput(asset_subject, identifier: "input1");

            var prices_subject = new Subject<StreamEvent<IONRecord_Struct>>();
            var prices_input = container.RegisterInput(prices_subject, identifier: "input2");

            var join = assets_input.Join(
                right: prices_input,
                leftKeySelector: l => l.Id,
                rightKeySelector: r => r.Id,
                resultSelector: (l, r) => new
                {
                    l.Id,
                    l.Data,
                    r_Data = r.Data,
                });
            container.RegisterOutput(join, identifier: "output1").Subscribe(onNext: p => Debug.Print($"{p}"));

            var left_join = assets_input
                .LeftOuterJoin(right: prices_input,
                    leftKeySelector: l => l.Id,
                    rightKeySelector: r => r.Id,
                    outerResultSelector: l => new
                    {
                        l.Id,
                        l.Data,
                        r_Data = string.Empty,
                    },
                    innerResultSelector: (l, r) => new
                    {
                        l.Id,
                        l.Data,
                        r_Data = r.Data
                    });

            //NEXT LINE THROWS !??
            //System.InvalidOperationException: 'Operation is not valid due to the current state of the object.'
            container.RegisterOutput(left_join, identifier: "output2").Subscribe(onNext: p => Debug.Print($"{p}"));

            var process = container.Restore(inputStream: null);
        }

OK, this answer is going to be verbose and yet totally unsatisfying:

This is by design.

That's the unsatisfying part. Here's the verbose part.

The root cause of this issue is that you have two streamable variables (assets_input and prices_input) that are being used in multiple places within your query. Trill follows the same semantics as Rx - each subscription causes an independent chain of subscriptions through the query down to the input sources, and each ends up being a separate stream of actual data flowing through.

Without the QueryContainer, what results from that situation is stream duplication. It's a little redundant; you have stream messages being duplicated, which will slow down your system, but it's not fatal. However, with a QueryContainer, that situation is bad when having to deal with checkpointing. It messes up our ability to tell what has and what has not been checkpointed yet, and causes some potential race conditions in restoration. In short, reusing a stream variable is "a bad idea" without a QueryContainer and simply not allowed with one.

The proximate solution for you in this case is to do a multicast on assets_input and prices_input, so you can reuse those streams while avoiding the situation stated above. It would look like:

var assets_multi = assets_input.Multicast(2); // Edited from previous version of comment to have proper parentheses

var prices_multi = prices_input.Multicast(2); // Edited from previous version of comment to have proper parentheses

var join = assets_multi[0].Join(prices_multi[0], ...)

var leftJoin = assets_multi[1].LeftOuterJoin(prices_multi[1], ...)

That said, this is what we call in the business a really bad user experience. There's no way that you could have figured this out from just the code. So at a minimum, I'm going to go in and see if there is a better way to communicate this issue to the user.

On top of that, there is a design issue. We decided early on to inherit from Rx semantics and have separate subscription chains for every caller. However, it's clear from user feedback that said semantics are confusing. I'm not even sure that Rx it totally happy with what they did. So we may go back and try to change the design to only have a single subscription chain irrespective of how many times a variable is used. That's a major change, so it's not something that will happen in the next few months, but one that comes up every now and then as a request and that we will seriously consider.

Thanks James. As always, your explanations are very clear.
BTW, I guess yo meant
var assets_multi = assets_input.Multicast(2); //with parenthesis
instead of
var assets_multi = assets_input.Multicast[2]; //with square brackets

...there is a design issue. We decided early on to inherit from Rx semantics and have separate subscription chains for every caller. However, it's clear from user feedback that said semantics are confusing. I'm not even sure that Rx it totally happy with what they did. So we may go back and try to change the design to only have a single subscription chain irrespective of how many times a variable is used.

I think I ran into this when starting with Rx and I've certainly spent time on this when teaching others. What would we lose by not having separate subscriptions or why do you think the Rx design went that route in the first place?

@nsulikowski Correct, apologies. I will go back and edit my comment.

@NickDarvey I think the Rx design is based on the semantics of Enumerable. I know that a lot of the work that went into Rx semantics are based on the theory that IObservable is a dual of IEnumerable. A subscription in that case is a "dual" of an enumeration, and each enumeration of an IEnumerable starts at the beginning and is independent of one another. As such, each subscription in essence "begins" at the beginning of the observable (even though in practice the IObservable may not be able to "reset" back to its original state on each subscription).

That explains why it's where we are, but concretely, what would we lose if we switch to assuming multicast on each variable reuse? Honestly, I'm not sure. In Rx, there may be something lost, but in the streaming world, I cannot help but think that we'd be better off assuming that variables are more like branches in a stream than adhering to an inherited semantics.