Scaling issues with large numbers of columns in pyspark

Question

Scaling issues with large numbers of columns in pyspark

jamie256 opened this issue a year ago · comments

Description

~~mapInPandas udf does not support vectors natively, whylogs pyspark tries to convert these to then profile them, but~~ whylogs pyspark profiling trying to process all columns at once even when there are thousands of columns, this can cause performance and stability issues on larger datasets. See https://github.com/whylabs/whylogs/blob/mainline/python/whylogs/api/pyspark/experimental/profiler.py#L65

Suggestions

Let's limit the transformation of vector columns and instead skip profiling them by default so we don't block profiling larger datasets that happen to have sparse vectors. The vector processing is important for embeddings use cases but we can require this scenario be opt in with config on the interested columns.

Add a default batch size of the number of columns to process at a time, we can use a test dataset to look for a reasonable value using data frames with 10,000 columns on cluster size of 12 instances of c6.xlarge (8 GB, 4 CPU per instance).

I have reviewed the Guidelines for Contributing and the Code of Conduct.

github-actions · Answer 1 · Mon Jul 10 2023 21:09:16 GMT+0800 (China Standard Time)

This issue is stale. Remove stale label or it will be closed tomorrow.

github-actions · Answer 2 · Mon Aug 14 2023 21:04:55 GMT+0800 (China Standard Time)

This issue is stale. Remove stale label or it will be closed tomorrow.