databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Creating Series with exist Int64Index results in error

amueller opened this issue · comments

from databricks import koalas
series = koalas.Series([0, 1, 2])
true_series = koalas.Series(True, index=series.index)

ValueError: The truth value of a Int64Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

this is Koalas 1.8.0 and pandas 1.2.4

true_series = koalas.Series(True, index=series.index.to_pandas())

works.

Thanks :)

Thanks for the report, @amueller .

As you mentioned in the description, Koalas doesn't allow creating the Series with the Koalas Index.

When creating the Koalas Series, the pandas DataFrame is needed for creating the InternalFrame.

So, if Koalas want to allow creating Series with the Koalas Index, we should use to_pandas() internally which is dangerous since it move the all distributed data into a single node. (Yes, just like you did in the your code explicitly)

We recommend to use to_pandas() explicitly like you did in your code for now, when only you're sure that your data size is small enough.

You can check the more detail about the Koalas internal in the Koalas internal.

Oh, anyway, Koalas will be ported into PySpark since Spark 3.2, so this repository now only in maintenance mode.

I'd recommend to use pandas module in PySpark after Spark 3.2 release.

You can find the more details in SPIP: Support pandas API layer on PySpark!

Thanks for the explanation! It would be great to allow using koalas indexes. I don't see how to do it now if the index is large. Anyway closing here if the repository is in maintenance mode.