Pyspark module - enable other date formats to be coerced into DateType
Smartitect opened this issue · comments
Problem description
When using the pandera.pyspark
module, only one date format can be coerced successfully into a DateType
- the format yyyy-MM-dd
. Other date formats such as dd/MM/yyyy
are not coerced successfully, indeed they lead to a column of type DateType
which is empty.
Example where coerce works
dataframe_schema_coerce_date = DataFrameSchema(
columns={
"index": Column(
dtype=IntegerType,
checks=None,
nullable=False,
coerce=True,
required=True,
),
"date": Column(
dtype=DateType,
nullable=False,
coerce=True,
required=True,
),
},
)
df_to_validate_date_format_1 = spark.createDataFrame(
[
(1, "2022-03-12"),
(2, "2021-01-21"),
(3, "2022-10-08"),
(4, "2022-12-28"),
(5, "2022-09-05"),
],
["index", "date"])
df_1_validated = dataframe_schema_coerce_date.validate(df_to_validate_date_format_1)
df_1_validated.printSchema()
Generates the following output:
root
|-- index: integer (nullable = true)
|-- date: date (nullable = true)
Inspecting the data using df_1_validated.show()
shows that all of the strings have been successfully coerced into dates:
+-----+----------+
|index| date|
+-----+----------+
| 1|2022-03-12|
| 2|2021-01-21|
| 3|2022-10-08|
| 4|2022-12-28|
| 5|2022-09-05|
+-----+----------+
Example where this doesn't work
Applying the same dataframe schema as defined above:
df_to_validate_date_format_2 = spark.createDataFrame(
[
(1, "12/03/2022"),
(2, "01/21/2021"),
(3, "10/08/2022"),
(4, "12/28/2022"),
(5, "09/05/2022"),
],
["index", "date"])
df_2_validated = dataframe_schema_coerce_date.validate(df_to_validate_date_format_2)
df_2_validated.show()
Yields:
+-----+----+
|index|date|
+-----+----+
| 1|null|
| 2|null|
| 3|null|
| 4|null|
| 5|null|
+-----+----+
Solution that I'd like
It would be great if you could somehow specify the date pattern (for example, by using the approach adopted by Apache Pyspark) against which to parse the strings in order to:
- Validate the string values are indeed valid dates in accordance with the prescribed date format.
- Successfully corerce the strings into
DateType
.
Alternatives I've considered
A subsequent step outside of Pandera which converts the dates using native Pyspark SQL commands such as:
for column in dataframe_schema.columns:
if type(dataframe_schema.columns[column].dtype) == pandera.engines.pyspark_engine.Date:
df = df.withColumn(
dataframe_schema.columns[column].name,
to_date(col(dataframe_schema.columns[column].name), "dd/MM/yyyy"))