`pyspark.DataFrameSchema` doesn't have data syntheis method(s) implemented

Question

`pyspark.DataFrameSchema` doesn't have data syntheis method(s) implemented

kasperjanehag opened this issue 8 months ago · comments

Is your feature request related to a problem? Please describe.
Currently, the pyspark.DataFrameSchema class does not support synthetic data generation. This feature is extremely useful for testing and development purposes, as it allows developers to quickly generate example data that adheres to a specific schema.

Describe the solution you'd like
I would like to see a feature similar to the one provided by the pandera library for synthetic data generation on pandas based data modules. In theory I guess it could look like:

from pandera.pyspark import DataFrameModel

class ExampleSchema(DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    product_name: T.StringType() = pa.Field(str_startswith="B")
    price: T.DecimalType(20, 5) = pa.Field()
    description: T.ArrayType(T.StringType()) = pa.Field()
    meta: T.MapType(T.StringType(), T.StringType()) = pa.Field()

PanderaSchema.example(size=3, ...))

This would then generates a DataFrame with 3 rows, where each column adheres to the constraints defined in the schema.

Additional context
It's worth noting that the synthetic data generation doesn't necessarily need to use PySpark to generate the data. In most development scenarios, the volume of synthetic data required wouldn't be so large as to necessitate a distributed backend.

Kasper Janehag · Answer 1 · Fri Feb 02 2024 21:29:26 GMT+0800 (China Standard Time)

If anyone can provide a bit of guidence for how this should be implemented, I could be keen to go down the rabbit hole and implement it! :)