queryverse / CSVFiles.jl

FileIO.jl integration for CSV files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Errors in saving non-standard element types

bkamins opened this issue · comments

Consider the following code:

julia> df = DataFrame(x = [',','\n', ','])
3×1 DataFrame
│ Row │ x    │
│     │ Char │
├─────┼──────┤
│ 1   │ ','  │
│ 2   │ '\n' │
│ 3   │ ','  │

julia> df |> save("test.csv")

julia> println(read("test.csv", String))
"x"
,


,


julia>

And the saved file is broken because non-strings are saved as not quoted.

Here is an extreme example (not to say it happens in reality, but just shows that it could be handled better). The code is a continuation of the earlier code:

julia> DataFrame(d=[df, df]) |> save("test2.csv")

julia> println(read("test2.csv", String))
"d"
3×1 DataFrame
│ Row │ x    │
│     │ Char │
├─────┼──────┤
│ 1   │ ','  │
│ 2   │ '\n' │
│ 3   │ ','  │
3×1 DataFrame
│ Row │ x    │
│     │ Char │
├─────┼──────┤
│ 1   │ ','  │
│ 2   │ '\n' │
│ 3   │ ','  │

and it is completely unreadable back (even as string) because it is not quoted again.

Finally let us consider a more normal scenario, which is again broken because of non-quoting:

julia> df = DataFrame(a=Date("2000-10-10"), b=Date("2000-11-11"))
1×2 DataFrame
│ Row │ a          │ b          │
│     │ Date       │ Date       │
├─────┼────────────┼────────────┤
│ 1   │ 2000-10-10 │ 2000-11-11 │

julia> df |> save("test3.csv", delim="-")

julia> println(read("test3.csv", String))
"a"-"b"
2000-10-10-2000-11-11

@davidanthoff Not sure which of the issues above can be fixed but at least I wanted you to be aware of them.

Thanks for reporting these, these are clearly bugs!

I guess a quick, partial fix would be to just always write quotes around Char (that really seems better in general), and maybe also around dates? Or maybe around every type, except when we know that we don't need them (numbers, some other exceptions)?

This is what I thought. The only problem is that when you quote them then you might need to escape something in the quotes (as in the last example with dates). This means that when reading it back you would have to unquote the string before trying to parse it, which would introduce a computational overhead (and I guess this is what TextParse.jl wants to avoid).