databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support datetime for std

itholic opened this issue · comments

pandas support datetime64 or datetime64tz dtypes for std from pandas 1.2 (pandas-dev/pandas#37436)

And it returns Timedelta Series which is Koalas currently cannot support.

>>> pdf = pd.DataFrame(
...     {
...         "A": pd.date_range("2020-01-01", periods=3),
...         "B": pd.date_range("2021-01-01", periods=3),
...     }
... )
>>> kdf = ks.from_pandas(pdf)

>>> pdf.std()
A   1 days
B   1 days
dtype: timedelta64[ns]

>>> kdf.std()
Series([], dtype: float64)
import databricks.koalas as ks
import pandas as pd

pdf = pd.DataFrame(
    {
        "A": pd.date_range("2020-01-01", periods=3),
        "B": pd.date_range("2021-01-01", periods=3),
    }
)
kdf = ks.from_pandas(pdf)

# Calculate the standard deviation after converting Timedelta to numeric
std_result = kdf.select_dtypes(include=["timedelta"]).apply(lambda x: x.dt.total_seconds()).std()
print(std_result)

>>> std_result = kdf.select_dtypes(include=["timedelta"]).apply(lambda x: x.dt.total_seconds()).std()
>>> print(std_result)
Series([], dtype: float64)

Seems like the suggested method still returns empty Series.
Btw, switching Koalas to Pandas API on Spark is recommended as Koalas is migrated into PySpark.