Window Operations
JotaBlanco opened this issue · comments
Is your feature request related to a problem? Please describe.
When doing streaming processing, I find myself needing to keep windows of data in memory very VERY often. Like, if I want to calculate average speed over the last 5 min, I have to keep the last 5 mins of speed data in memory (speed + timestamp) and update that window on every new message (both appending the new data to it and trimming out the now older than 5 mins data). I do this so often that I've ended up creating a python class that I keep using on my projects. This is fine, but I'm sure you guys can do it better :)
Describe the solution you'd like
I'd love that the sdk had some window functionality. Maybe as a TimeseriesData method. As a first feature, I propose that the window is created with:
- Parameter: the parameter you want to create a window on (speed in my initial example)
- Window size: either in number of messages or time period
On next iterations, if these windows allowed several parameter, they could be used to perform enriching operations (table joins in batch). However iteration of nulls, and other issues would have to be solved.
Describe alternatives you've considered
I was trying to think if it was worth doing this using kafka pointers and some of its charaacteristics (segments, etc.). but probably is not. Even if there was a way to create a rolling window with pointers you would be querying a big chunk of data every time. Doing it from the client point of view seems a better idea.
Additional context
I can share my current python class for reference:
`
class Window:
def init(self, app_abv: str, window_nanoseconds: int):
self.df_window = pd.DataFrame()
self.window_size = window_nanoseconds
def calculate(self, df: pd.DataFrame):
# Add new data to appliance window
self._update_mini_df_window(df_i)
# Here mathematical operations are done with the updated window
# Fill df with new values
df = pd.merge(df, self.df_window[["time", "new_col"]], how='left', on=["time"])
return df
def _update_mini_df_window(self, df:pd.DataFrame):
# If df_window is empty, start it with df
if self.df_window.empty:
self.df_window = df
# If df_window has more recent data than what we are getting in real time, there has been some messed up, maybe with replays :S.
# Erase any data older than the newly received df
elif self.df_window["time"].iloc[-1] > df["time"].iloc[-1]:
self.df_window = self.df_window[self.df_window["time"]<=df["time"].iloc[-1]]
self.df_window = pd.concat([self.df_window, df]).drop_duplicates()
self.df_window = self.df_window.sort_values("time", ascending=True).reset_index(drop=True)
# Interpolate to fill nulls in key columns
self.df_window[cat_col] = self.df_window[cat_col].fillna(method="ffill", axis=0)
self.df_window[num_col] = self.df_window.set_index('time')[num_col].interpolate('index').values
else:
# Else, concat data and ensure chronological order
self.df_window = pd.concat([self.df_window, df]).drop_duplicates()
self.df_window = self.df_window.sort_values("time", ascending=True).reset_index(drop=True)
# Interpolate to fill nulls in key columns
self.df_window[cat_col] = self.df_window[cat_col].fillna(method="ffill", axis=0)
self.df_window[num_col] = self.df_window.set_index('time')[num_col].interpolate('index').values
# TRIM WINDOW
"""
# If df window is getting big in size, we just trim it
bytes_in_a_GB = 1073741824
if self.df_window.memory_usage(deep=True).sum()/bytes_in_a_GB > 0.25: # if bigger than 0.25 gb
self.df_window = self.df_window.iloc[-1000:] # Restart window to last 1000 rows
if self.df_window.memory_usage(deep=True).sum()/bytes_in_a_GB > 0.25: # if this is still big, restart
self.df_window = self.df_window.iloc[[-1]]
"""
# Keep only needed rows
self.df_window = self.df_window[self.df_window["time"] > (self.df_window['time'].iloc[-1] - self.window_size)]
self.df_window["TEMP_timeDelta"] = self.df_window["time"] - self.df_window["time"].shift(1)
`