MobileTeleSystems / Ambrosia

Ambrosia is a Python library for A/B tests design, split and result measurement

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implementation of basic PySpark data preprocessing methods

xandaau opened this issue · comments

For the tasks of preprocessing pandas data and speeding up experiments, we have the Preprocessor class and a number of base classes with single functionality at preprocessing.
These methods should be implemented for spark dataframes, in the same paradigm as we have for the Designer and the Splitter.

At this moment, the implementation of the following methods is essential:

  1. Aggregation
  2. Outliers removal (robust)
  3. CUPED

Still did not take into account the possibility of PySpark functionality implementation in the architecture of the added preprocessing classes in #22