BUG DateTimeEncoder fails when a column mixes formats
jeromedockes opened this issue · comments
Describe the bug
The example shown in the DateTimeEncoder's docstring raises an exception
Steps/Code to Reproduce
from skrub import DatetimeEncoder
enc = DatetimeEncoder()
X = [['2022-10-15'], ['2021-12-25'], ['2020-05-18'], ['2019-10-15 12:00:00']]
enc.fit(X)
enc.transform(X)
Expected Results
the example runs
Actual Results
ValueError: unconverted data remains when parsing with format "%Y-%m-%d": " 12:00:00", at position 3. You might want to try:
- passing `format` if your strings have a consistent format;
- passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.
Versions
System:
python: 3.11.5 (main, Aug 25 2023, 13:19:50) [GCC 11.4.0]
executable: /home/jerome/.virtualenvs/df/bin/python
machine: Linux-6.2.0-32-generic-x86_64-with-glibc2.35
Python dependencies:
sklearn: 1.3.0
pip: 23.2.1
setuptools: 65.5.0
numpy: 1.25.2
scipy: 1.11.2
Cython: None
pandas: 2.1.0
matplotlib: 3.7.3
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 4
prefix: libgomp
filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
user_api: blas
internal_api: openblas
num_threads: 4
prefix: libopenblas
filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
version: 0.3.23.dev
threading_layer: pthreads
architecture: Prescott
user_api: blas
internal_api: openblas
num_threads: 4
prefix: libopenblas
filepath: /home/jerome/.virtualenvs/df/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
version: 0.3.21.dev
threading_layer: pthreads
architecture: Prescott
0.0.1.dev0
(also fails with columns that have a single format but diverse timezones)
This seems to accidentally be solved by #743 (but not the diverse timezones error), because we don't use pd.to_datetime
anymore, and pd.DatetimeIndex
seems to be less strict. But I'm wondering if we shouldn't actually use pd.to_datetime
and be strict to prevent bad surprises. Taking a step back, maybe we should add some conversion logic to the DatetimeEncoder
. I think it was originally designed as a part of the TableVectorizer
, and thus supposed to deal with already converted date series. It does work with raw date strings, but this part can probably be improved, probably by sharing the conversion logic between the TableVectorizer
and the DatetimeEncoder
. WDYT?
great! for sure we can think more about handling columns containing messy date strings, the respective responsibilities of the datetimeencoder and tablevectorizer, and what support if any we want for timezones. still as the example now runs without problems I think we can close this issue once #743 is merged