skrub-data / skrub

Prepping tables for machine learning

Home Page:https://skrub-data.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

datetimeencoder is very slow

MarcoGorelli opened this issue · comments

Describe the bug

Looks like the format is being guessed for every single element, twice (once with day first, once with month first)

np.vectorize doesn't speed things up, it's just syntactic sugar

The code below has just 70 thousand rows, but it 14 seconds to execute on my laptop

Steps/Code to Reproduce

from pprint import pprint
import pandas as pd

data = pd.DataFrame({
    'date.utc': pd.date_range('1900-01-01', '2100-01-01', freq='1D').strftime('%Y-%m-%d'),
    'city': 'Paris',
    'value': 3.
}
)
print('data shape: ', data.shape)
# Extract our input data (X) and the target column (y)
y = data["value"]
X = data[["city", "date.utc"]]

X

from skrub import to_datetime

X = to_datetime(X)
X.dtypes

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from skrub import DatetimeEncoder

encoder = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore"), ["city"]),
    (DatetimeEncoder(add_day_of_the_week=True, resolution="minute"), ["date.utc"]),
    remainder="drop",
)

X_enc = encoder.fit_transform(X)
pprint(encoder.get_feature_names_out())

Expected Results

No more than 1 second, probably 😄

Actual Results

the results are correct, just too slow

Versions

System:
    python: 3.11.6 (main, Oct 23 2023, 22:48:54) [GCC 11.4.0]
executable: /home/marcogorelli/skrub-dev/.venv/bin/python
   machine: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.0
          pip: 23.1.2
   setuptools: 65.5.0
        numpy: 1.25.2
        scipy: 1.11.2
       Cython: None
       pandas: 2.1.1
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX
0.0.1.dev0

Thank you for spotting this, @MarcoGorelli. We haven't checked the computational performances yet, indeed.
I think subsampling makes sense here and is a very simple solution, @GaelVaroquaux. I'll open a PR.