BUG DateTimeEncoder fails when a column mixes formats

Question

BUG DateTimeEncoder fails when a column mixes formats

jeromedockes opened this issue 10 months ago · comments

Describe the bug

The example shown in the DateTimeEncoder's docstring raises an exception

Steps/Code to Reproduce

from skrub import DatetimeEncoder

enc = DatetimeEncoder()
X = [['2022-10-15'], ['2021-12-25'], ['2020-05-18'], ['2019-10-15 12:00:00']]

enc.fit(X)
enc.transform(X)

Expected Results

the example runs

Actual Results

ValueError: unconverted data remains when parsing with format "%Y-%m-%d": " 12:00:00", at position 3. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

Versions

System:
    python: 3.11.5 (main, Aug 25 2023, 13:19:50) [GCC 11.4.0]
executable: /home/jerome/.virtualenvs/df/bin/python
   machine: Linux-6.2.0-32-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.0
          pip: 23.2.1
   setuptools: 65.5.0
        numpy: 1.25.2
        scipy: 1.11.2
       Cython: None
       pandas: 2.1.0
   matplotlib: 3.7.3
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: libgomp
       filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Prescott

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Prescott
0.0.1.dev0

Jérôme Dockès · Answer 1 · Wed Sep 20 2023 23:13:13 GMT+0800 (China Standard Time)

(also fails with columns that have a single format but diverse timezones)

LeoGrin · Answer 2 · Fri Sep 22 2023 21:26:49 GMT+0800 (China Standard Time)

This seems to accidentally be solved by #743 (but not the diverse timezones error), because we don't use pd.to_datetime anymore, and pd.DatetimeIndex seems to be less strict. But I'm wondering if we shouldn't actually use pd.to_datetime and be strict to prevent bad surprises. Taking a step back, maybe we should add some conversion logic to the DatetimeEncoder. I think it was originally designed as a part of the TableVectorizer, and thus supposed to deal with already converted date series. It does work with raw date strings, but this part can probably be improved, probably by sharing the conversion logic between the TableVectorizer and the DatetimeEncoder. WDYT?

Jérôme Dockès · Answer 3 · Tue Sep 26 2023 21:27:56 GMT+0800 (China Standard Time)

great! for sure we can think more about handling columns containing messy date strings, the respective responsibilities of the datetimeencoder and tablevectorizer, and what support if any we want for timezones. still as the example now runs without problems I think we can close this issue once #743 is merged