databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot return empty dataframe in apply?

lfdversluis opened this issue · comments

I am parsing some data and in a groupby + apply function, I wanted to return an empty dataframe if some criteria are not met. This causes obscure crashes with Koalas. Example:

spark = SparkSession.builder \
        .master("local[8]") \
        .appName("arser") \
        .config("spark.executor.memory", "20G") \
        .config("spark.driver.memory", "8G") \
        .config("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") \
        .config("spark.executor.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") \
        .getOrCreate()

def toApply(df):
    if df['a'].iloc[0] > 1:  # Imagine a sanity check here
        return pd.DataFrame()
    return df

df = ks.DataFrame({"a": [1,2], "b": [3, 4]})
df.groupby("a").apply(toApply)

This gives ValueError: can not infer schema from empty or null dataset. Ok sure, let's give it a schema?

I modified it to

def toApply(df):
    if df['a'].iloc[0] > 1:  # Imagine a sanity check here
        schema = StructType([
            StructField('something', LongType(), True),
        ])
        return spark.createDataFrame([], schema)
    return df

and now I get AttributeError: 'DataFrame' object has no attribute 'copy' and RecursionError: maximum recursion depth exceeded.

I added a workaround for now, but I think this is interesting behavior :) I am running spark 3.0.0 and koalas 1.6.0.

For completeness, here is the pandas variant:

def toApply(df):
    if df['a'].iloc[0] > 1:  # Imagine a sanity check here
        return pd.DataFrame()
    return df

df = pd.DataFrame({"a": [1,2], "b": [3, 4]})
df.groupby("a").apply(toApply)

Outputs:

-- a b
a -- --
1 1.0 3.0

Hm, which Spark version do you use? it works fine in my local:

>>> df.groupby("a").apply(toApply)
       a    b
a
1 0  1.0  3.0

Ah, okay. I think you might need to specify the return type in toApply:

def toApply(df):
    if df['a'].iloc[0] > 1:  # Imagine a sanity check here
        return df[:0]  # keep the column index
    return df

df = ks.DataFrame({"a": [1,2], "b": [3, 4]})
df.groupby("a").apply(toApply)

or:

def toApply(df) -> ks.DataFrame["a": float, "b": float]:
    if df['a'].iloc[0] > 1:  # Imagine a sanity check here
        return df[:0]  # keep the column index
    return df

df = ks.DataFrame({"a": [1,2], "b": [3, 4]})
df.groupby("a").apply(toApply)

to avoid type inference process. See also https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.apply.html#databricks-koalas-groupby-groupby-apply

The problem here seems like Koalas requires to keep the column index from toApply.

@HyukjinKwon Thanks for your reply. While returning (a part of) the original dataframe provided in the apply function works, returning a new one does not (using your suggestion).

This does seem to work:

spark = SparkSession.builder \
        .master("local[8]") \
        .appName("WTA parser") \
        .config("spark.executor.memory", "20G") \
        .config("spark.driver.memory", "8G") \
        .config("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") \
        .config("spark.executor.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") \
        .getOrCreate()

def toApply(df) -> pd.DataFrame["c": int, "d": int]:
    ret_df = pd.DataFrame(columns=["c", "d"])
    if df['a'].iloc[0] > 1:  # Imagine a sanity check here
        return ret_df
    
    ret_df.loc[len(ret_df.index)] = [0, 1]
    return ret_df

tdf = ks.DataFrame({"a": [1,2], "b": [3, 4]})
tdf.groupby("a").apply(toApply)

But I noticed an inconsistency in the output:
When using Koalas:

c d
0 0 1

When using Pandas (just changing ks.DataFrame to pd.DataFrame):

c d
a
1 0 0 1

Pandas keeps the the groupby index a and also keeps the range index introduced by ret_df (creating a MultiIndex)

Yeah, when you use the type hints, the index is lost, which is the limitation there currently.