queryverse / TextParse.jl

A bunch of fast text parsing tools

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make date format guess algorithm more robust

kescobo opened this issue · comments

I've been running into a weird date parsing issue, and I can't sort out what the pattern is, though I've managed to nail down a MWE

The linked csv has 4 rows of dates.

julia> csvread("parse_test.csv")
ERROR: ArgumentError: Month: 27 out of range (1:12)
Stacktrace:
 [1] Date(::Int64, ::Int64, ::Int64) at ./dates/types.jl:204
 [2] tryparsenext(::TextParse.DateTimeToken{Date,DateFormat{Symbol("yyyy/mm/dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}}}, ::String, ::Int64, ::Int64, ::TextParse.LocalOpts) at /Users/kev/.julia/v0.6/TextParse/src/field.jl:431
 [3] macro expansion at /Users/kev/.julia/v0.6/TextParse/src/util.jl:23 [inlined]
 [4] tryparsenext(::TextParse.Field{Date,TextParse.DateTimeToken{Date,DateFormat{Symbol("yyyy/mm/dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}}}}, ::String, ::Int64, ::Int64, ::TextParse.LocalOpts) at /Users/kev/.julia/v0.6/TextParse/src/field.jl:569
#...

(the stack trace is super long, let me know if it would be useful to post the whole thing)

There are 3 27s, two in the second row, and one in the last row. If I remove just the last row, it works.

But if I leave the 4th row in and just change the 27 in the last row to a 2, I get the same ERROR: ArgumentError: Month: 27 out of range (1:12).

If I change all the 27s to 2s, I now get ERROR: ArgumentError: Month: 21 out of range (1:12), and again this error goes away if I delete the last row, even though there are no 21s in the last row.

There's not just something weird with that row - this is part of a much larger csv file, and removing only row 4 does not stop the error.

Note - originally posted as issue to CSVFiles.jl, but this error seems to be caused by this package.

The problem here is that the column type detection algorithm here goes wrong. It classifies the third column as yyyy/mm/dd, which is clearly wrong.

I think the whole classification logic for date time columns is not very good: as far as I can tell it essentially classifies purely based on the last row of the type detection rows. A better algorithm would choose the date format for a given column based on all rows in the type detection story for that column.

I'm changing the title to reflect the todo here: make the type detection algorithm more robust for date time columns.

The workaround for now is to manually specify the date format for the columns.

Can I bump this issue? I recently had a .shp file that I wanted to read with the GeoDataFrames.read function and I get the error:
ERROR: ArgumentError: Month: 0 out of range (1:12)
However, upon further investigation, the months range from 3 to 12 and there is no such row with month being 0.