microsoft / Trill

Trill is a single-node query processor for temporal or streaming data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Time aware aggregations for temporal streams?

AlgorithmsAreCool opened this issue · comments

Howdy,

I'm trying to use Trill to cluster together discrete events into groups.
I want to group together events that occur within 2 mins of each other
For example if i have these event times

1:00:23PM
1:01:15PM
1:01:20PM
1:25:11PM
1:25:30PM
1:29:55PM

I need to group them into these groups.

Group 1
1:00:23PM
1:01:15PM
1:01:20PM

Group 2
1:25:11PM
1:25:30PM

Group 3
1:29:55PM

Once they are grouped then i need to aggregate them into 3 final events.

Now i've been trying to work with SessionTimeoutWindow which does the first step, but then I noticed the grouping or aggregate APIs don't seem to offer time-aware implementations, so i can't operate with time as first class data.

I realize i could use ToStreamEventObservable and then use normal linq to group and aggregate, but I was under the impression that i should operate on Streamables as much as possible first to take advantage of Trill's engine.

Is there a way to implement event lifetime aware groupings in Trill?

commented

Hi @AlgorithmsAreCool,

IAggregate does support timestamp aware aggregations by passing the timestamp of the current event to the Accumulate/Deaccumulate methods, but when used after a windowing operator such as TumblingWindow, this timestamp may be modified to the window time, not the original timestamp. So if you'd like to persist the original timestamp or duration, you could embed those values as part of the payload, e.g.:

inputStream
    .Select((payload, originalTime) => ValueTuple.Create(payload, originalTime))
    .TumblingWindowLifetime(100)
    .Aggregate(...);

or simply include the timestamps as part of the payload during ingress. This would then allow you to access the original timestamp in your IAggregate class implementation.

Hmm, i must have missed this.
But this is just the starting edge time, the duration/end time isn't flowed through this API correct?

Well, yes and no.

It is absolutely correct that the end time/duration is not directly part of the API. However, with one important exception, it is actually there, hidden in the interface.

The IAggregate<T, S, O> interface has two methods, Accumulate and Deaccumulate, that accept a time argument.

  • For Accumulate, the time argument marks the time at which an element enters consideration for the aggregate. This is almost certainly the same as the "start time" for an element.
  • For Deaccumulate, the time argument marks the time at which an element leaves consideration for the aggregate. This is almost certainly the same as the "end time" for an element.

However, in the particular case you have, the Deaccumulate method won't be much help. That's because of how SessionWindow does timelines. Consider the case of your first group. After the session operation, you will get data that looks like the following:

  • Start edge for 1:00:23PM at 1:00:23PM
  • Start edge for 1:01:15PM at 1:01:15PM
  • Start edge for 1:01:20PM at 1:01:20PM
  • End edges for all three of the above simultaneously at 1:03:20PM, two minutes past the last point (two minutes being your session window timeout.

The main question I have for you in return is this: how do you want to have the final edge handled in the result? Should it be considered to have the time weight of the session timeout, so two minutes?

Here's roughly what I would do in your situation:

`
public sealed class SessionTimeWeightedAverage : IAggregate<T, Dictionary<long, T>, U>
{
private long sessionWindowTimeout { get; }

public SessionTimeWeightedAverage(long timeout) => this.sessionWindowTimeout = timeout;

public Expression<Func<Dictionary<long, T>> InitialState() => () => new Dictionary<long, T>();

public Expression<Func<Dictionary<long, T>, long, T, Dictionary<long, T>>> Accumulate() => (state, time, element) => NewState(state, time, element);

private Dictionary<long, T> NewState(Dictionary<long, T> oldState, long time, T element)
{
    oldState.Add(time, element);
    return oldState;
}

// Implementations of deaccumulate and difference following the above pattern

public Expression<Func<Dictionary<long, T>, U>> ComputeResult() => (state) => YourResultComputation()

}
`

The state manipulation is actually easier depending on the computation you're trying to do. For instance, instead of maintaining a full dictionary, you may be able to instead just keep around the last time value and compute the deltas on each Accumulate call (making the "NewState" method above slightly more complicated but the state simpler). Your method YourResultComputation above would then need to know the session timeout window to be able to determine how the last data element would contribute.

I hope this helps - please let us know if we can be of more help.

hmm, i need to play around with this some.

Thank you for your help!

Gonna close this until i can get back around to it