Batch Processing : How to withhold output until input terminates

Question

Batch Processing : How to withhold output until input terminates

AlgorithmsAreCool opened this issue 5 years ago · comments

Hello, thank you very much for designing and open sourcing this system. I've been reading y'alls papers on Trill, FASTER and Quill for years now.

I have a few questions about how to use it.

Trill for batch processing

I realize that Trill is primarily a streaming data engine. But I work with a lot of batched data also that i would like to perform queries against using Trill.

Maybe i don't have the egress side of things setup correctly but my setup looks like this

public static async Task EventsPerHour(string rootFolder)
{
    await
        LogExtractor
        .Create()
        .ExtractSingleSiteDirectory(rootFolder)
        .AsObservable()
        .ToTemporalStreamable(
            e => e.Entry.Timestamp.Ticks,
            DisorderPolicy.Drop(TimeSpan.TicksPerHour),
            FlushPolicy.None,
            PeriodicPunctuationPolicy.None())
        .GroupApply(
            e => e.Entry.Timestamp.Hour,
            g => g.Count(),
            (g, v) => new { Hour = g.Key, Count = v })
        .ToStreamEventObservable()
        .ForEachAsync(i => {
            Console.WriteLine(i.Payload);
        });
}

So what i would like to get out of this somehow is a single list of (Hour, Count) pairs.

What i actually get is a lot of incremental updates as the data flows in. To compensate i made a handler method that tracks all the updates per group and only keeps the last one. It produces correct output, but it seems wasteful to have the engine continue to produce output that i'm discarding.

Can i tell Trill to withhold output until the input stream terminates? If so how?

Weakened Discoverability

Also, why are so many things marked [EditorBrowsable(EditorBrowsableState.Never)]? For example, I see people using the 3 argument version of group apply in examples, but for whatever reason you have GroupSelectorInput<T>.Key marked as never browseable, making the result selector function seem useless initially.

Is there a reason this property (and others like it) is hidden?

GroupApply vs Partition+Aggregate+SelectByKey

Are these two constructions equivalent? If so which should i prefer?

 .GroupApply(
    e => e.Entry.Timestamp.Hour,
    g => g.Count(),
    (g, v) => new { Hour = g.Key, Count = v })
//yield type `IStreamable<Empty,'a>

vs.

.Partition(e => e.Entry.Timestamp.Hour)
.Aggregate(g => g.Count())
.SelectByKey((time, key, count) => new { Hour = key, Count = count })
//yields type IStreamable<PartitionKey<int>, 'a>

James Terwilliger · Answer 1 · Sat Jun 01 2019 01:16:57 GMT+0800 (China Standard Time)

My sincere apologies for how long it has taken to comment on this - for some reason, I didn't get a notification that your issue posted. I'm looking at this right now and will give you some guidance within a couple of hours.

James Terwilliger · Answer 2 · Sat Jun 01 2019 01:19:20 GMT+0800 (China Standard Time)

Re: Discoverability

Most of the things marked as not being discoverable are fields, methods, or types that we would rather not be public but have to be in order for code generation to function properly. However, the specific example you give is clearly an oversight - one should be able to access the key directly. I will fix that in the next release.

AlgorithmsAreCool · Answer 3 · Sat Jun 01 2019 01:27:28 GMT+0800 (China Standard Time)

Hey, it isn't your fault about the delay. I tripped some kind of spam filter when i made this issue and Github suspended my account so it was hidden from you. It took them a couple of days to fix it.

Also, there isn't a massive rush on me needing answers. I've been having a lot of fun reading all the papers and working with Trill.

James Terwilliger · Answer 4 · Sat Jun 01 2019 01:33:09 GMT+0800 (China Standard Time)

Ah, the wonderment that is spam filters. I'm glad you're having fun with Trill - it's a blast to work on, too.

Re: Partition

The partition method does something kind of special and magical. I'll try to explain as best I can.

There is a concept within Trill called "partitioned streams". This feature is one way to get around the restriction within Trill that all data must be in order post-ingress. What it allows is for data to follow an independent timeline per partition. For instance, if you have data coming from a collection of sensors, and you want to do a query per sensor (normally done using GroupApply) but each sensor's data may arrive at the processing node with different network lag, partitioned streams allows each sensor to have its data treated as its own timeline. Global disorder policies (e.g., Drop) are then applied on a per-sensor basis rather than globally.

The way that you "enable" this feature is by ingressing PartitionedStreamEvents instead of StreamEvents. Alternatively, one can enable this feature by using ToPartitionedStreamable instead of ToTemporalStreamable. In both of these cases, the result is that you end up with a stream of type IStreamable<PartitionKey, P>. The marker "PartitionKey" in the key type of the streamable means you're in partitioned world, the world's strangest theme park.

Now, the method Partition allows the user to introduce partitions in the middle of a query rather than at ingress. This method allows the user to then do temporal operations on the data without worrying about keeping all data in order. For instance, a concrete feature request that we got was to be able to do different windowing on data based on a key. The Partition method allows the user to split the timeline, thus allowing each individual partition to be windowed independently without any fear of misordering. You could then have one partition do a tumbling window on an hour, another partition have a hopping window of 10 minutes with a hop of a minute, and so forth.

A good example of the Partition method in action is the Rules Engine example in our samples repo.

AlgorithmsAreCool · Answer 5 · Sat Jun 01 2019 03:25:16 GMT+0800 (China Standard Time)

That is a nice bit of flexibility to have!

James Terwilliger · Answer 6 · Sat Jun 01 2019 03:45:54 GMT+0800 (China Standard Time)

Re: your initial example

Have you tried using TumblingWindowLifetime(TimeSpan.TicksPerHour) instead of your GroupApply? That should give you one result per hour, only returned when the full hour of data has passed.

AlgorithmsAreCool · Answer 7 · Sat Jun 01 2019 04:10:45 GMT+0800 (China Standard Time)

Hmm, looking at my notes, later versions of my queries did switch to tumbling windows. They seem to work great with one exception:

If there is no data for a time period, the data returned data will skip over the empty windows.

As an example, I have a bunch of logs I'm ingressing and filtering for particular events. I'm trying to get per-hour counts of event occurrence. My query looks like this

await
    LogExtractor
        .Create()
        .ExtractSingleSiteDirectory(siteLogFolder)
        .AsObservable()
        .Where(e => e.Entry.Method == "ScaryEvent")
        .ToTemporalStreamable(
            e => e.Entry.Timestamp.Ticks,
            DisorderPolicy.Drop(TimeSpan.TicksPerMinute),
            flushPolicy: FlushPolicy.FlushOnPunctuation,
            periodicPunctuationPolicy: PeriodicPunctuationPolicy.None()
            )
        .Select(e => Empty.Default) //hopefully saving memory since i only need counts???
        .TumblingWindowLifetime(TimeSpan.TicksPerHour)
        .Count()
        .ToStreamEventObservable()
        .ForEachAsync(evt => {
            var start = ToDateTime(evt.StartTime).Value;
            if (evt.IsData)
            {
                var end = ToDateTime(evt.EndTime);
                Console.WriteLine($"{start} {end}  - {evt.Payload}");
            }
        });

This works, but produces output like this

5/22/2019 4:00:00 AM 5/22/2019 5:00:00 AM  - 2970
5/22/2019 5:00:00 AM 5/22/2019 6:00:00 AM  - 2750
5/22/2019 6:00:00 AM 5/22/2019 7:00:00 AM  - 35
5/22/2019 7:00:00 AM 5/22/2019 8:00:00 AM  - 1595
5/23/2019 4:00:00 PM 5/23/2019 5:00:00 PM  - 240
5/23/2019 5:00:00 PM 5/23/2019 6:00:00 PM  - 855
5/23/2019 6:00:00 PM 5/23/2019 7:00:00 PM  - 4610

Note that there is a time skip in between 8AM -> 5PM where there were no events.
I can manually fill in these empties with a little stateful selector on the egress, but is there a built in way to get the empty windows too?

AlgorithmsAreCool · Answer 8 · Sat Jun 01 2019 04:19:05 GMT+0800 (China Standard Time)

Oh and lastly has the research from this paper made it into the wild yet? I'm super interested in the multiple egress latency selection stuff.

James Terwilliger · Answer 9 · Sat Jun 01 2019 12:40:25 GMT+0800 (China Standard Time)

I'm sorry to say the multiple latency work hasn't made it into the main branch nor is on the GitHub site (still only on our deprecated internal repo). I'll reach out to Yinan and see if he's willing to at least port his research prototype to GitHub in a branch.

James Terwilliger · Answer 10 · Sat Jun 01 2019 12:42:07 GMT+0800 (China Standard Time)

I believe that there is a way to fill in the missing gaps with nulls - let me try to reach into the deep well of my brain's recycle bin and see if I can page that back.

AlgorithmsAreCool · Answer 11 · Tue Jul 23 2019 05:20:35 GMT+0800 (China Standard Time)

For posterity,

If you set you periodic punctuations to the same interval as your window, Trill will emit punctuation event for the missing intervals.

var query =
    inputObservable
    .ToTemporalStreamable(
        startEdgeExtractor: item => item.Timestamp.Ticks, 
        periodicPunctuationPolicy: PeriodicPunctuationPolicy.Time(TimeSpan.TicksPerHour)
        )
    .TumblingWindowLifetime(TimeSpan.TicksPerHour)
    ...