Cannot return empty dataframe in apply?
lfdversluis opened this issue · comments
I am parsing some data and in a groupby + apply function, I wanted to return an empty dataframe if some criteria are not met. This causes obscure crashes with Koalas. Example:
spark = SparkSession.builder \
.master("local[8]") \
.appName("arser") \
.config("spark.executor.memory", "20G") \
.config("spark.driver.memory", "8G") \
.config("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") \
.config("spark.executor.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") \
.getOrCreate()
def toApply(df):
if df['a'].iloc[0] > 1: # Imagine a sanity check here
return pd.DataFrame()
return df
df = ks.DataFrame({"a": [1,2], "b": [3, 4]})
df.groupby("a").apply(toApply)
This gives ValueError: can not infer schema from empty or null dataset
. Ok sure, let's give it a schema?
I modified it to
def toApply(df):
if df['a'].iloc[0] > 1: # Imagine a sanity check here
schema = StructType([
StructField('something', LongType(), True),
])
return spark.createDataFrame([], schema)
return df
and now I get AttributeError: 'DataFrame' object has no attribute 'copy'
and RecursionError: maximum recursion depth exceeded
.
I added a workaround for now, but I think this is interesting behavior :) I am running spark 3.0.0 and koalas 1.6.0.
For completeness, here is the pandas variant:
def toApply(df):
if df['a'].iloc[0] > 1: # Imagine a sanity check here
return pd.DataFrame()
return df
df = pd.DataFrame({"a": [1,2], "b": [3, 4]})
df.groupby("a").apply(toApply)
Outputs:
-- | a | b |
---|---|---|
a | -- | -- |
1 | 1.0 | 3.0 |
Hm, which Spark version do you use? it works fine in my local:
>>> df.groupby("a").apply(toApply)
a b
a
1 0 1.0 3.0
Ah, okay. I think you might need to specify the return type in toApply
:
def toApply(df):
if df['a'].iloc[0] > 1: # Imagine a sanity check here
return df[:0] # keep the column index
return df
df = ks.DataFrame({"a": [1,2], "b": [3, 4]})
df.groupby("a").apply(toApply)
or:
def toApply(df) -> ks.DataFrame["a": float, "b": float]:
if df['a'].iloc[0] > 1: # Imagine a sanity check here
return df[:0] # keep the column index
return df
df = ks.DataFrame({"a": [1,2], "b": [3, 4]})
df.groupby("a").apply(toApply)
to avoid type inference process. See also https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.apply.html#databricks-koalas-groupby-groupby-apply
The problem here seems like Koalas requires to keep the column index from toApply
.
@HyukjinKwon Thanks for your reply. While returning (a part of) the original dataframe provided in the apply function works, returning a new one does not (using your suggestion).
This does seem to work:
spark = SparkSession.builder \
.master("local[8]") \
.appName("WTA parser") \
.config("spark.executor.memory", "20G") \
.config("spark.driver.memory", "8G") \
.config("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") \
.config("spark.executor.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") \
.getOrCreate()
def toApply(df) -> pd.DataFrame["c": int, "d": int]:
ret_df = pd.DataFrame(columns=["c", "d"])
if df['a'].iloc[0] > 1: # Imagine a sanity check here
return ret_df
ret_df.loc[len(ret_df.index)] = [0, 1]
return ret_df
tdf = ks.DataFrame({"a": [1,2], "b": [3, 4]})
tdf.groupby("a").apply(toApply)
But I noticed an inconsistency in the output:
When using Koalas:
c d
0 0 1
When using Pandas (just changing ks.DataFrame
to pd.DataFrame
):
c d
a
1 0 0 1
Pandas keeps the the groupby index a
and also keeps the range index introduced by ret_df (creating a MultiIndex)
Yeah, when you use the type hints, the index is lost, which is the limitation there currently.