nubank / fklearn

fklearn: Functional Machine Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

space_time_split_dataset slowness

tatasz opened this issue · comments

def space_time_split_dataset(dataset: pd.DataFrame,

I've been having slowness issues with the space_time_split_dataset. It takes about 30-60 minutes to split a 0.5-1million rows dataset, which seems somewhat excessive (I know that, manually, this would take a couple of minutes only).

Code:

TRAIN_END_DATE = '2019-04-01' 
HOLDOUT_END_DATE = '2019-07-01'

split_fn = space_time_split_dataset(train_start_date=TRAIN_START_DATE,
                                train_end_date=TRAIN_END_DATE,
                                holdout_end_date=HOLDOUT_END_DATE,
                                space_holdout_percentage=.5,
                                split_seed=42, 
                                space_column="id",
                                time_column='time')

df_train, df_test, _, df_holdout = split_fn(data)
df_train.shape, df_test.shape, df_holdout.shape```