Pyspark module - enable other date formats to be coerced into DateType

Question

Pyspark module - enable other date formats to be coerced into DateType

Smartitect opened this issue a year ago · comments

Problem description

When using the pandera.pyspark module, only one date format can be coerced successfully into a DateType - the format yyyy-MM-dd. Other date formats such as dd/MM/yyyy are not coerced successfully, indeed they lead to a column of type DateType which is empty.

Example where coerce works

dataframe_schema_coerce_date = DataFrameSchema(
    columns={
        "index": Column(
            dtype=IntegerType,
            checks=None,
            nullable=False,
            coerce=True,
            required=True,
        ),
        "date": Column(
            dtype=DateType,
            nullable=False,
            coerce=True,
            required=True,
        ),
    },
)

df_to_validate_date_format_1 = spark.createDataFrame(
    [
        (1, "2022-03-12"),
        (2, "2021-01-21"),
        (3, "2022-10-08"),
        (4, "2022-12-28"),
        (5, "2022-09-05"),
        ],
    ["index", "date"])

df_1_validated = dataframe_schema_coerce_date.validate(df_to_validate_date_format_1)
df_1_validated.printSchema()

Generates the following output:

root
 |-- index: integer (nullable = true)
 |-- date: date (nullable = true)

Inspecting the data using df_1_validated.show() shows that all of the strings have been successfully coerced into dates:

+-----+----------+
|index|      date|
+-----+----------+
|    1|2022-03-12|
|    2|2021-01-21|
|    3|2022-10-08|
|    4|2022-12-28|
|    5|2022-09-05|
+-----+----------+

Example where this doesn't work

Applying the same dataframe schema as defined above:

df_to_validate_date_format_2 = spark.createDataFrame(
    [
        (1, "12/03/2022"),
        (2, "01/21/2021"),
        (3, "10/08/2022"),
        (4, "12/28/2022"),
        (5, "09/05/2022"),
        ],
    ["index", "date"])

df_2_validated = dataframe_schema_coerce_date.validate(df_to_validate_date_format_2)
df_2_validated.show()

Yields:

+-----+----+
|index|date|
+-----+----+
|    1|null|
|    2|null|
|    3|null|
|    4|null|
|    5|null|
+-----+----+

Solution that I'd like

It would be great if you could somehow specify the date pattern (for example, by using the approach adopted by Apache Pyspark) against which to parse the strings in order to:

Validate the string values are indeed valid dates in accordance with the prescribed date format.
Successfully corerce the strings into DateType.

Alternatives I've considered

A subsequent step outside of Pandera which converts the dates using native Pyspark SQL commands such as:

for column in dataframe_schema.columns:
    if type(dataframe_schema.columns[column].dtype) == pandera.engines.pyspark_engine.Date:
        df = df.withColumn(
            dataframe_schema.columns[column].name,
            to_date(col(dataframe_schema.columns[column].name), "dd/MM/yyyy"))