microsoft / Trill

Trill is a single-node query processor for temporal or streaming data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Big and slow-moving reference data

NickDarvey opened this issue · comments

I want combine my input streams with some non-stream-based reference data which is big (many records) and slow-moving (it changes, but no where near the speed of my input streams). How do I approach I best approach this with Trill?

I could think of a couple of ways of approaching it, but neither seemed ideal—and may not even be valid anyway (I'm new to Trill)—so I wanted to ask if there were a recommended solution for such a scenario.

Scenario

I have a input stream of sensor data and a database which stores a relationship between a particular sensor and the operator who owns it.

I take my sensor data input stream, apply some fancy heuristics and determine outliers. I want to notify the operator who owns the sensor of those outliers, so I need to join these events with that database table somehow.

Potential Solution A: Sensor-owner table to a streamable

Lift the rows from my database table to a streamable and use a Trill join operator.

var owners = db.Query<Owner>("select sensor_id, owner_email_address from sensor_owner")
	.ToObservable()
	.ToStreamable()
	.Cache()

sensorStream
	.DoFancyHeuristics()
	.Join(owners, l => l.SensorId, r => r.SensorId, (l, r) => new {
		l.SensorId,
		l.Measurement,
		r.OwnerEmailAddress
	})
	.SendEmailEtc()

However, what if that sensor_owner table is freaking large? (Or we don't have any viable way of scraping the entire set of relationships?)
What about when there are changes made to it? (We could refresh our streamable cache, but we might not be able to do anything smarter than refreshing all of the rows.)

We could instead defer a (specific) relationship query till after there's one of these outliers events, which brings me to...

Potential Solution B: Streamable to observable and back again

Leave the realm of streamables and use an Rx operator.

sensorStream
	.DoFancyHeuristics()
	.ToStreamEventObservable()
	.Select(e => db
		.Query<Owner>("select sensor_id, owner_email_address from sensor_owner where sensor_id = $1", e.SensorId)
		.Select(owner => new { e.SensorId, e.Measurement, owner.OwnerEmailAddress })
		.ToObservable())
	.Concat()
	.ToStreamable()
	.SendEmailEtc()

Here we're paying the cost of a network call on each outlier because we don't want to (or can't) pay the cost of storing it all in memory.
Leaving and re-entering the realm of streamables seems a little off intuitively, would it actually be fine in practice? One thing that comes to mind is if I'm using the HA feature, now I have two checkpoints to take so I no longer have a consistent snapshot of my query state.
(I understand in this scenario I could very well not return to streamables, but in similar scenarios I might want to continue composing operators.)

Other solutions

Leaving Trill

I understand I could leave streamables after I've translated my sensor data to my outlier events, but the complexity of the query is likely to evolve in ways which would very much like me to remain up in the realm of streamables. (e.g. now I want to batch these outliers into one notification to operators.)

Higher-order streamables

I think I could achieve this succinctly if there were support for higher-order streamables in future, but I also understand that such a thing might not be appropriate or compatible with the design of Trill. (e.g. something like Rx's Merge operator not as useful when you add the strict temporal semantics of Trill, though still possible I think...)

sensorStream
	.DoFancyHeuristics()
	.Select(e => db
		.Query<Owner>("select sensor_id, owner_email_address from sensor_owner where sensor_id = $1", e.SensorId)
		.Select(owner => new { e.SensorId, e.Measurement, owner.OwnerEmailAddress })
		.ToObservable()
		.ToStreamable())
	.Concat()
	.SendEmailEtc()

My suggestion to use here is one of the overloads of the SelectMany operator.

Here's the logic behind it:

  • As you noted, if your slowly-changing reference dataset is extremely large, keeping it in memory is kind of nuts. It would explode the size of the join synopsis within the Trill engine.
  • SelectMany can be used to do join-like operations, where you take each individual row and return anywhere from zero to many rows as a response.
  • SelectMany has an overload that looks like this: stream.SelectMany((time, payload) => YourMethod(time, payload))
  • This overload "lifts" the time of the payload (more precisely, the start time of the payload, but synonymous with "time" for point events) so you can use it in your query logic.
  • Thus, the method YourMethod would be what you use to look up into the reference data. You would use the "time" parameter to determine where in your slowly changing reference data your payload should be validated.

Let's start there, and if that doesn't meet your needs, let's try something else. Hope that helps.

You know... I think started down this path after looking at that SelectMany, so I should have started my issue there too.

SelectMany is almost exactly what I'm looking for. I think I immediately went 'nope' because it expects an IEnumerable<T> in return, and I needed to make a network call to the database and was hoping for a signature like:

IStreamable<TResult> SelectMany<TPayload, TResult>(this IStreamable<TPayload> source, Expression<Func<TPayload, Task<IEnumerable<TResult>>>> selector);
// or
IStreamable<TResult> SelectMany<TPayload, TResult>(this IStreamable<TPayload> source, Expression<Func<TPayload, IObservable<TResult>>> selector);
// or
IStreamable<TResult> SelectMany<TPayload, TResult>(this IStreamable<TPayload> source, Expression<Func<TPayload, IStreamable<TResult>>> selector);

so I wouldn't be 'blocking' on IO.

Thinking this through, it would (depending on the flattening semantics I guess) be blocking the processing pipeline anyway so why bother releasing the thread? There wouldn't be other work to do.

In my particular scenario, I am working towards supporting hosting standing queries as defined by the user, so there might actually be a tonne of other work to do. (I haven't tested it so it might not be a big deal at all, but the 'don't do IO-bound operations synchronously' mantra has been beaten into me.)
I also understand that might not be a common use case for Trill so it might not be worth investigating.

Intuitively it feels like having support for higher-order streamables with binding and folding operators (SelectMany and Merge, Concact, Switch, etc) would be great for composability, but maybe that's not actually the case with Trill's algebra.

The biggest issue with accepting an IObservable or an IStreamable instead of an IEnumerable is that it interferes with the in-order execution model within the engine. We'd need to ensure that whatever asynchronous callback is used is also coupled with the timestamp, and then do ordering of the returned data, and then also have a mechanism for dealing with late-arriving data, etc.

An alternative implementation of the engine with the same API/algebra could probably do what you suggest, but it would need to be natively out-of-order, with all of the perf/throughput/capability tradeoffs therein.

The Join, Antisemijoin, and Union operators do take multiple IStreamable arguments. We're always open to new operators that cover scenarios we may be missing, though.

How large of a database are you talking about? You could theoretically implement a local cache that does the asynchronous calls to refresh from the database.

The biggest issue with accepting an IObservable or an IStreamable instead of an IEnumerable is that it interferes with the in-order execution model within the engine.

The temporal semantics would be interesting with any flattening operator. From an API perspective you could be required define a disorder policy for the merging/concatenating/switching (just like you do when building a new stream). I'm guessing this might be hell to implement in the engine though?

IStreamable<T> Merge<T>(this IStreamable<IStreamable<T>> source, DisorderPolicy policy)

How large of a database are you talking about?

The real life scenario that inspired this question had a relationship table in the with ~8 million rows. I could very well keep a local copy but trying to keep it up-to-date with frequent refreshes might irritate the DBA.

Thank you for providing some options, I think I'm leaning towards using that SelectMany and just do a blocking call in this case.