Is there a way to use a UDF or lambda in groupby agg?

Question

Is there a way to use a UDF or lambda in groupby agg?

kylegilde opened this issue 3 years ago · comments

The following code doesn't work. Thank you!

@pandas_udf('string')
def as_set(x):
    return str(set(x))
spark.udf.register('as_set', as_set)


kdf = ks.DataFrame(
    {'a': [1, 2, 2, 4, 5, 6],
     'b': ["one", "one", "one", "two", "two", "two"]},
    index=[10, 20, 30, 40, 50, 60]
)
kdf.groupby(['b']).agg({'a', as_set})

ValueError: aggs must be a dict mapping from column name to aggregate functions (string or list of strings).

Haejoon Lee · Answer 1 · Thu Dec 09 2021 09:50:34 GMT+0800 (China Standard Time)

Thanks for the report, @kylegilde !

And currently the Koalas project is only in maintaining mode, so the response could be quite delayed.

The Koalas project is currently being managed more actively in PySpark under the name of "pandas API on Spark" (you can simply re-use the existing Koalas code by importing import pyspark.pandas as ks)

So if you're going to continue using Koalas, I recommend using PySpark! (You can get a quicker response if you report the issue to the Apache Spark JIRA)