Scaling issues with large numbers of columns in pyspark
jamie256 opened this issue · comments
Description
mapInPandas udf does not support vectors natively, whylogs pyspark tries to convert these to then profile them, but whylogs pyspark profiling trying to process all columns at once even when there are thousands of columns, this can cause performance and stability issues on larger datasets. See https://github.com/whylabs/whylogs/blob/mainline/python/whylogs/api/pyspark/experimental/profiler.py#L65
Suggestions
Let's limit the transformation of vector columns and instead skip profiling them by default so we don't block profiling larger datasets that happen to have sparse vectors. The vector processing is important for embeddings use cases but we can require this scenario be opt in with config on the interested columns.
- Add a default batch size of the number of columns to process at a time, we can use a test dataset to look for a reasonable value using data frames with 10,000 columns on cluster size of 12 instances of c6.xlarge (8 GB, 4 CPU per instance).
- I have reviewed the Guidelines for Contributing and the Code of Conduct.
This issue is stale. Remove stale label or it will be closed tomorrow.
This issue is stale. Remove stale label or it will be closed tomorrow.