microsoft / Trill

Trill is a single-node query processor for temporal or streaming data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Flushes in joins issue?

nsulikowski opened this issue · comments

Bug in joins?

    public struct IONRecord
    {
        public string Id;
        public string Data;

        public override string ToString() => $" {{{nameof(Id)}:{Id}, {nameof(Data)}:{Data}}}";
    }

        [Fact]
        public void LeftOuterJoin_FlushProblem_Test()
        {
            var left_subject = new Subject<StreamEvent<IONRecord>>();
            var right_subject = new Subject<StreamEvent<IONRecord>>();

            var left_stream = left_subject.ToStreamable(disorderPolicy: DisorderPolicy.Throw(), flushPolicy: FlushPolicy.FlushOnPunctuation, periodicPunctuationPolicy: null);
            var right_stream = right_subject.ToStreamable(disorderPolicy: DisorderPolicy.Throw(), flushPolicy: FlushPolicy.FlushOnPunctuation, periodicPunctuationPolicy: null);

            var output_events = new List<string>();
            var output_stream = left_stream
                .LeftOuterJoin(right: right_stream,
                               leftKeySelector: l => l.Id,
                               rightKeySelector: r => r.Id,
                               outerResultSelector: l => new
                               {
                                   l_Id = l.Id,
                                   r_Id = (string)null
                               },
                               innerResultSelector: (l, r) => new
                               {
                                   l_Id = l.Id,
                                   r_Id = r.Id
                               });

            var output_observable = output_stream.ToStreamEventObservable(reshapingPolicy: ReshapingPolicy.None);
            output_observable.Subscribe(onNext: se =>
            {
                if (se.IsData) output_events.Add(se.ToString());
            });

            //Start with no output events
            Assert.Equal(0, output_events.Count);

            //No flush on first left is ok (I guess...)... waiting for sync time to move forward
            left_subject.OnNext(StreamEvent.CreateStart(701, new IONRecord { Id = "c1", Data = "shortdes1" }));
            Assert.Equal(0, output_events.Count);

           
            //Eh? Sync time on the left moved forward... but no flush yet
            left_subject.OnNext(StreamEvent.CreatePunctuation<IONRecord>(punctuationTime: 702));
            Assert.Equal(0, output_events.Count);

            //Only flushing when both, left and right move forward... i thought in joins the sync time is supposed 
            //to be the max sync time from the left and right
            right_subject.OnNext(StreamEvent.CreatePunctuation<IONRecord>(punctuationTime: 702));
            Assert.Equal(new[] {
                "[Start: 701,{ l_Id = c1, r_Id =  }]",
            }, output_events);
        }
 

Hi @nsulikowski , this appears to be by design. We cannot flush any output from a LeftOuterJoin without time progressing from the right - we currently have no idea whether there the left events will or will not match events that have yet to be ingressed from the right until the right's sync time progresses. Anything output before time progresses on the right would be incorrect.

E.g., if after these two events, we were to output the left start edge at 701, assuming there is no match from the right:

left_subject.OnNext(StreamEvent.CreateStart(701, new IONRecord { Id = "c1", Data = "shortdes1" }));           
left_subject.OnNext(StreamEvent.CreatePunctuation<IONRecord>(punctuationTime: 702));

then an event comes in from the right at 701:

right_subject.OnNext(StreamEvent.CreateStart(701, new IONRecord { Id = "c1", Data = "shortdes2" }));

then the outer projected start edge at 701 that we previously output is now incorrect, since we did not wait for the right side to progress and process the matching event.

The left and right streams (can) operate on independent timelines, and the join will buffer events when one side is ahead of the other. So no, the right event at 701 would not be considered out of order.

If you would like the two to be on the same timeline, you could perhaps broadcast the punctuations to both sides. Or, if they are originating from the same stream, you can just multicast that stream to both sides of the join (with Where in between).