Window Operations

Question

Window Operations

JotaBlanco opened this issue a year ago · comments

Is your feature request related to a problem? Please describe.
When doing streaming processing, I find myself needing to keep windows of data in memory very VERY often. Like, if I want to calculate average speed over the last 5 min, I have to keep the last 5 mins of speed data in memory (speed + timestamp) and update that window on every new message (both appending the new data to it and trimming out the now older than 5 mins data). I do this so often that I've ended up creating a python class that I keep using on my projects. This is fine, but I'm sure you guys can do it better :)

Describe the solution you'd like
I'd love that the sdk had some window functionality. Maybe as a TimeseriesData method. As a first feature, I propose that the window is created with:

Parameter: the parameter you want to create a window on (speed in my initial example)
Window size: either in number of messages or time period
On next iterations, if these windows allowed several parameter, they could be used to perform enriching operations (table joins in batch). However iteration of nulls, and other issues would have to be solved.

Describe alternatives you've considered
I was trying to think if it was worth doing this using kafka pointers and some of its charaacteristics (segments, etc.). but probably is not. Even if there was a way to create a rolling window with pointers you would be querying a big chunk of data every time. Doing it from the client point of view seems a better idea.

Additional context
I can share my current python class for reference:

`
class Window:
def init(self, app_abv: str, window_nanoseconds: int):
self.df_window = pd.DataFrame()
self.window_size = window_nanoseconds

def calculate(self, df: pd.DataFrame):

    # Add new data to appliance window
    self._update_mini_df_window(df_i)
    
    # Here mathematical operations are done with the updated window
    
    
    # Fill df with new values
    df = pd.merge(df, self.df_window[["time", "new_col"]], how='left', on=["time"])
    
    return df              


def _update_mini_df_window(self, df:pd.DataFrame):

    # If df_window is empty, start it with df
    if self.df_window.empty:
        self.df_window = df

    # If df_window has more recent data than what we are getting in real time, there has been some messed up, maybe with replays :S. 
    # Erase any data older than the newly received df
    elif self.df_window["time"].iloc[-1] > df["time"].iloc[-1]:
        self.df_window = self.df_window[self.df_window["time"]<=df["time"].iloc[-1]]
        self.df_window = pd.concat([self.df_window, df]).drop_duplicates()
        self.df_window = self.df_window.sort_values("time", ascending=True).reset_index(drop=True)

        # Interpolate to fill nulls in key columns
        self.df_window[cat_col] = self.df_window[cat_col].fillna(method="ffill", axis=0)
        self.df_window[num_col] = self.df_window.set_index('time')[num_col].interpolate('index').values
    
    else:
        # Else, concat data and ensure chronological order
        self.df_window = pd.concat([self.df_window, df]).drop_duplicates()
        self.df_window = self.df_window.sort_values("time", ascending=True).reset_index(drop=True)
        
        # Interpolate to fill nulls in key columns
        self.df_window[cat_col] = self.df_window[cat_col].fillna(method="ffill", axis=0)
        self.df_window[num_col] = self.df_window.set_index('time')[num_col].interpolate('index').values
    

    # TRIM WINDOW
    """
    # If df window is getting big in size, we just trim it
    bytes_in_a_GB = 1073741824
    if self.df_window.memory_usage(deep=True).sum()/bytes_in_a_GB > 0.25:    # if bigger than 0.25 gb
        self.df_window = self.df_window.iloc[-1000:] # Restart window to last 1000 rows
        if self.df_window.memory_usage(deep=True).sum()/bytes_in_a_GB > 0.25:    # if this is still big, restart
            self.df_window = self.df_window.iloc[[-1]]
    """
    
    # Keep only needed rows
    self.df_window = self.df_window[self.df_window["time"] > (self.df_window['time'].iloc[-1] - self.window_size)]
    self.df_window["TEMP_timeDelta"] = self.df_window["time"] - self.df_window["time"].shift(1)

`