Make date format guess algorithm more robust

Question

Make date format guess algorithm more robust

kescobo opened this issue 6 years ago · comments

I've been running into a weird date parsing issue, and I can't sort out what the pattern is, though I've managed to nail down a MWE

The linked csv has 4 rows of dates.

julia> csvread("parse_test.csv")
ERROR: ArgumentError: Month: 27 out of range (1:12)
Stacktrace:
 [1] Date(::Int64, ::Int64, ::Int64) at ./dates/types.jl:204
 [2] tryparsenext(::TextParse.DateTimeToken{Date,DateFormat{Symbol("yyyy/mm/dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}}}, ::String, ::Int64, ::Int64, ::TextParse.LocalOpts) at /Users/kev/.julia/v0.6/TextParse/src/field.jl:431
 [3] macro expansion at /Users/kev/.julia/v0.6/TextParse/src/util.jl:23 [inlined]
 [4] tryparsenext(::TextParse.Field{Date,TextParse.DateTimeToken{Date,DateFormat{Symbol("yyyy/mm/dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}}}}, ::String, ::Int64, ::Int64, ::TextParse.LocalOpts) at /Users/kev/.julia/v0.6/TextParse/src/field.jl:569
#...

(the stack trace is super long, let me know if it would be useful to post the whole thing)

There are 3 27s, two in the second row, and one in the last row. If I remove just the last row, it works.

But if I leave the 4th row in and just change the 27 in the last row to a 2, I get the same ERROR: ArgumentError: Month: 27 out of range (1:12).

If I change all the 27s to 2s, I now get ERROR: ArgumentError: Month: 21 out of range (1:12), and again this error goes away if I delete the last row, even though there are no 21s in the last row.

There's not just something weird with that row - this is part of a much larger csv file, and removing only row 4 does not stop the error.

Note - originally posted as issue to CSVFiles.jl, but this error seems to be caused by this package.

David Anthoff · Answer 1 · Mon Mar 18 2019 06:56:16 GMT+0800 (China Standard Time)

The problem here is that the column type detection algorithm here goes wrong. It classifies the third column as yyyy/mm/dd, which is clearly wrong.

I think the whole classification logic for date time columns is not very good: as far as I can tell it essentially classifies purely based on the last row of the type detection rows. A better algorithm would choose the date format for a given column based on all rows in the type detection story for that column.

I'm changing the title to reflect the todo here: make the type detection algorithm more robust for date time columns.

The workaround for now is to manually specify the date format for the columns.

Ali Hamed Moosavian · Answer 2 · Sat Jun 08 2024 00:46:17 GMT+0800 (China Standard Time)

Can I bump this issue? I recently had a .shp file that I wanted to read with the GeoDataFrames.read function and I get the error:
ERROR: ArgumentError: Month: 0 out of range (1:12)
However, upon further investigation, the months range from 3 to 12 and there is no such row with month being 0.