ogrodnek / csv-serde

Hive SerDe for CSV

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Specifying nulls

parautenbach opened this issue · comments

Is there a way to indicate a NULL value? The default Hive value is \N (capitalised). When I upload a CSV file containing that value, I simply get the value N.

According the the Hive DDL it is not possible to specify a SerDe and NULL terminator. Is there perhaps a class property for this SerDe that can be set (similar to how separator, quote and escape characters are set)? I could not find such a property in the source code.

It is vital that one can distinguish between NULL and empty values and it would be very helpful if this could be added as a feature.

Hi, there's no property at the moment... I'm wondering how other serdes handle this?....

I guess the problem has more to do with an ambiguity within CSVs, rather than a SerDe. For JSON SerDes this won't exist, as there is a clear distinction. Assuming a three column CSV including header:

A,B,C
foo,,bar

What does column B imply? A NULL or an empty string, assuming the Hive table's column B is of type STRING? I'm not sure what would be appropriate. If the type is anything but STRING, it becomes obvious, as the absence of a value for the source column B would indicate NULL (e.g. if the type was INT).

My suggestion (and I can't think of a better one right now), would be to interpret an empty value (as above) as an empty string when the Hive type is STRING. If the value is \N for any type, return Java null for that column's value (for non-STRING types an empty value would also imply NULL).

The onus would then be on the user loading data to explicitly encode values where it is desired to have NULL instead of empty string by setting the value to \N. It does, unfortunately, burden the user (consumer) with the internals behind the interface, in this case a specific convention used by Hive, which is seldom a good practice. Then again, the SerDe is very specific to Hive and won't be re-usable elsewhere.

I agree with @parautenbach . I have this issue currently. I don't really see the burden .. there needs to be some explicit way to specify what is NULL. Don't mind encoding NULLs as '\N' by default. Maybe this can be an option?

I meet this problem also.

Hey @ogrodnek - I know it's an old issue but are you by any chance planning to introduce such a change?

Hi @lkudyba ,

This serde code was contributed to Hive back in 2014, and is being maintained there.

See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde and https://issues.apache.org/jira/browse/HIVE-7777

(specific code is: https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/OpenCSVSerde.java )

As an aside, I am wondering if it's possible to handle this using an if conditional?

i.e. for reading select if(v == '\N', null, v), for writing select if(v == null, '\N', v) ?