databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Koalas vs Pandas

psaraogi24 opened this issue · comments

Hi, I recently started switching from Pandas to Koalas dataframe.
But while calculating the execution time, I figured that Koalas is taking almost 6X time compared to Pandas.

I think I am missing something here. Can I get some help?

Can I also please get some sample functions where Koalas would perform better than Pandas?

Are you doing any type of sorting/ranking? Some of these operations can take longer, because they will be done on multiple partitions. Also, complex execution plan is another case of a slowdown. Check this best practise page out for some examples:
https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html

Thanks for trying the Koalas :-)
It's hard to simply say Koalas is faster or slower than pandas in specific function.
The performance depends on many factors such as amount of data, number of clusters, or how are you using functions in context as @stepanlavrinenkoteck001 mentioned.
For example, performance differences may occur depending on the amount of data even with the same function.
In general, pandas is faster than Koalas when the size of data is small enough to fit on a single core.
If you want to more detailed answer, could you give an example you are using where the Koalas is 6x slower?