Implementation of basic PySpark data preprocessing methods

Question

Implementation of basic PySpark data preprocessing methods

xandaau opened this issue a year ago · comments

For the tasks of preprocessing pandas data and speeding up experiments, we have the Preprocessor class and a number of base classes with single functionality at preprocessing.
These methods should be implemented for spark dataframes, in the same paradigm as we have for the Designer and the Splitter.

At this moment, the implementation of the following methods is essential:

Aggregation
Outliers removal (robust)
CUPED

Artem K · Answer 1 · Tue Jan 31 2023 23:44:52 GMT+0800 (China Standard Time)

Still did not take into account the possibility of PySpark functionality implementation in the architecture of the added preprocessing classes in #22