whylabs / whylogs

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

Home Page:https://whylogs.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scaling issues with large numbers of columns in pyspark

jamie256 opened this issue · comments

Description

mapInPandas udf does not support vectors natively, whylogs pyspark tries to convert these to then profile them, but whylogs pyspark profiling trying to process all columns at once even when there are thousands of columns, this can cause performance and stability issues on larger datasets. See https://github.com/whylabs/whylogs/blob/mainline/python/whylogs/api/pyspark/experimental/profiler.py#L65

Suggestions

Let's limit the transformation of vector columns and instead skip profiling them by default so we don't block profiling larger datasets that happen to have sparse vectors. The vector processing is important for embeddings use cases but we can require this scenario be opt in with config on the interested columns.

  • Add a default batch size of the number of columns to process at a time, we can use a test dataset to look for a reasonable value using data frames with 10,000 columns on cluster size of 12 instances of c6.xlarge (8 GB, 4 CPU per instance).

This issue is stale. Remove stale label or it will be closed tomorrow.

This issue is stale. Remove stale label or it will be closed tomorrow.