Joining koalas frame with spark
ysomawar opened this issue · comments
Hello,
I am very new to Koalas just started with understanding. I am planning to use koalas with spark for large data processing.
I am trying to merge two large dataset by using koalas merge functionality, but observed that merging is not performing on spark, it is executing on local and resulting into slow performance same as pandas.
following is code block,
import databricks.koalas as ks
from pyspark.sql import SparkSession
#%% Setting up Spark
spark = SparkSession.builder \
.appName('koalas_test') \
.getOrCreate()
# Reading datasets
kdf = ks.read_csv('file_path1') # it create spark task with csv at NativeMethodAccessorImpl.java:0
kdf2 = ks.read_csv('file_path2') # it create spark task with csv at NativeMethodAccessorImpl.java:0
## Type of above frames is : databricks.koalas.frame.Dataframe
# Merging frames,
kdf3= kdf.merge(kdf2,on='id')
On merge, non of the spark task got created, it is merging the frames locally not taking advantage of spark.
spark version: 3.1.1
Could somebody please assist me how I can take advantage of spark for merging the frame (While using any koalas API)
Thanks in Advance.
Regards,
Yogesh
I think it should take advantage of Spark since it directly leverages Spark join function internally especially here:
koalas/databricks/koalas/frame.py
Line 7676 in 07d8462
Btw, I recommend you to use pyspark.pandas
module in PySpark, since Koalas is ported into PySpark.