datetimeencoder is very slow

Question

datetimeencoder is very slow

MarcoGorelli opened this issue 8 months ago · comments

Marco Edward Gorelli commented 8 months ago

Describe the bug

Looks like the format is being guessed for every single element, twice (once with day first, once with month first)

np.vectorize doesn't speed things up, it's just syntactic sugar

The code below has just 70 thousand rows, but it 14 seconds to execute on my laptop

Steps/Code to Reproduce

from pprint import pprint
import pandas as pd

data = pd.DataFrame({
    'date.utc': pd.date_range('1900-01-01', '2100-01-01', freq='1D').strftime('%Y-%m-%d'),
    'city': 'Paris',
    'value': 3.
}
)
print('data shape: ', data.shape)
# Extract our input data (X) and the target column (y)
y = data["value"]
X = data[["city", "date.utc"]]

X

from skrub import to_datetime

X = to_datetime(X)
X.dtypes

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from skrub import DatetimeEncoder

encoder = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore"), ["city"]),
    (DatetimeEncoder(add_day_of_the_week=True, resolution="minute"), ["date.utc"]),
    remainder="drop",
)

X_enc = encoder.fit_transform(X)
pprint(encoder.get_feature_names_out())

Expected Results

No more than 1 second, probably 😄

Actual Results

the results are correct, just too slow

Versions

System:
    python: 3.11.6 (main, Oct 23 2023, 22:48:54) [GCC 11.4.0]
executable: /home/marcogorelli/skrub-dev/.venv/bin/python
   machine: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.0
          pip: 23.1.2
   setuptools: 65.5.0
        numpy: 1.25.2
        scipy: 1.11.2
       Cython: None
       pandas: 2.1.1
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX
0.0.1.dev0

Gael Varoquaux · Answer 1 · Wed Nov 22 2023 19:09:30 GMT+0800 (China Standard Time)

Marco is right: we can't do this for every row by default. We would need something like: - take 10 rows - check the format - if it's consistent try to apply it on all rows - if fail move to slow route. Does that make sens?

Vincent M · Answer 2 · Wed Nov 22 2023 19:22:31 GMT+0800 (China Standard Time)

Thank you for spotting this, @MarcoGorelli. We haven't checked the computational performances yet, indeed.
I think subsampling makes sense here and is a very simple solution, @GaelVaroquaux. I'll open a PR.

Gael Varoquaux · Answer 3 · Wed Nov 22 2023 19:35:25 GMT+0800 (China Standard Time)

You guys rock! I love how we are identifying the practical bottlenecks and solving them fast