databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

select_dtypes with np.number returning []

Ben-Epstein opened this issue · comments

Koalas dataframes don't seem to recognize np.number as a type that columns conform to when using select_dtypes

Recreate:

from pyspark.sql import SparkSession
from databricks import koalas as ks
spark = SparkSession.builder.getOrCreate()

import pandas as pd
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
iris_frame = ks.DataFrame(iris.data, columns = iris.feature_names)

iris_frame.to_pandas().select_dtypes([np.number]).columns
>>> Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

iris_frame.select_dtypes([np.number]).columns
>>> Index([], dtype='object')

When comparing directly, it works as expected:

for dt in iris_frame[:100].dtypes:
    print(dt == np.number, dt)

>>> True float64
>>> True float64
>>> True float64
>>> True float64
pandas==1.0.3
koalas==1.6.0
pyarrow==0.16.0

Any ideas or workarounds?

Thanks!

@Ben-Epstein, sorry for the late and thanks for the report this issue.

Seems like np.number is treated as float64 in pandas.

>>> pd.Series([1, 2, 3], dtype=np.number).dtype
dtype('float64')

But pandas doesn't return the <class 'numpy.float64'> with infer_dtype_from_object(np.number), but <class 'numpy.number'>

>>> infer_dtype_from_object(np.number)
<class 'numpy.number'>

We should fix this, and I think you can workaround with just using float64 as below in the meantime.

>>> iris_frame.select_dtypes([np.float64]).columns
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

@itholic thanks, that's what I ended up doing. I listed out all of the available number classes that fall under np.number and pass those in