databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Predicate Pushdown not Working

Lukas012 opened this issue · comments

Hi all,

Environment: Spark 3.0.2, Koalas: 1.8.2, Delta Lake 0.7

I've a Delta-Table partioned by column "PARTITION". Koalas doesn't seem to execute predicate pushdown.

  1. Using Spark:
my_kdf = ks.read_delta(f"...")
my_df = my_kdf.to_spark()
result_df = my_df.filter((col("PARTITION") == 15) & (col("ID") == 1))
result_df.to_koalas().toPandas()

Takes: 20 seconds

  1. Same with koalas:
result_kdf = ks.read_delta(f"...")
result_kdf = result_kdf [(result_kdf ["PARTITION"] == 15) & (result_kdf ["ID"] == 1)]
result_kdf.toPandas()

Takes 130 seconds (seems that it doesnt execute predicate pushdown)

  1. Other try with koalas:
my_kdf = ks.read_delta(f"...")
result_kdf = my_kdf [(my_kdf ["PARTITION"] == 15)]
result_kdf = result_kdf [(result_kdf ["ID"] == 1)]
result_kdf.toPandas()

Takes: 20 seconds.

Why takes 2. so long?

Thanks!
Best

Why? This problem only occurs in koalas.

@Lukas012 Koalas is ported into PySpark under the name "pandas API on Spark", and this repository is only in maintenance mode. You can get faster feedback in Apache Spark community.

FYI: and also you can use Koalas code as is in the Apache Spark as below:

# import databricks.koalas as ks
import pyspark.pandas as ks

... (existing Koalas codes)