Creating Series with exist Int64Index results in error

Question

Creating Series with exist Int64Index results in error

amueller opened this issue 3 years ago · comments

from databricks import koalas
series = koalas.Series([0, 1, 2])
true_series = koalas.Series(True, index=series.index)

ValueError: The truth value of a Int64Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

this is Koalas 1.8.0 and pandas 1.2.4

true_series = koalas.Series(True, index=series.index.to_pandas())

works.

Thanks :)

Haejoon Lee · Answer 1 · Fri Jun 11 2021 09:45:12 GMT+0800 (China Standard Time)

Thanks for the report, @amueller .

As you mentioned in the description, Koalas doesn't allow creating the Series with the Koalas Index.

When creating the Koalas Series, the pandas DataFrame is needed for creating the InternalFrame.

So, if Koalas want to allow creating Series with the Koalas Index, we should use to_pandas() internally which is dangerous since it move the all distributed data into a single node. (Yes, just like you did in the your code explicitly)

We recommend to use to_pandas() explicitly like you did in your code for now, when only you're sure that your data size is small enough.

You can check the more detail about the Koalas internal in the Koalas internal.

Haejoon Lee · Answer 2 · Fri Jun 11 2021 09:48:40 GMT+0800 (China Standard Time)

Oh, anyway, Koalas will be ported into PySpark since Spark 3.2, so this repository now only in maintenance mode.

I'd recommend to use pandas module in PySpark after Spark 3.2 release.

You can find the more details in SPIP: Support pandas API layer on PySpark!

Andreas Mueller · Answer 3 · Sat Jun 12 2021 01:15:48 GMT+0800 (China Standard Time)

Thanks for the explanation! It would be great to allow using koalas indexes. I don't see how to do it now if the index is large. Anyway closing here if the repository is in maintenance mode.