ogrodnek / csv-serde

Hive SerDe for CSV

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a property to skip the first N rows

ccady opened this issue · comments

Many CSV files have a "title row" e.g. :

id,lastname,firstname
"12323","Washington","George"
"12343","Lincoln","Abraham"

It would be nice to be able to set a "linesToSkip" property to skip the first N rows:

row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "linesToSkip" = 1
)

(I am so not set up to do any Java coding -- If I were, I'd do it myself and submit a patch.)

This functionality cannot be implemented at SerDe level as it controls only the row format. There is no information of current position in file for deserialize() method to do that.

To implement a linesToSkip functionality, we should implement a custom input format (perhaps extending TextInputFormat: http://hadoop.apache.org/docs/r1.2.1/api/index.html).

what about defining the csv-headers in the table's serdeproperties? The serde can then skip the line which matches the header-spec:

create table 
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "header" = "id,lastname,firstname"
)

As of now I need a step that removes the header for each csv-file which is quite cumbersome: One table and a serde is, without this feature, not enough to parse a csv-file.

EDIT: I forgot that a serde always need to output one row (unless it throws an exception but that is rather ugly).

The problem is that it will make a string comparison for every row in the file, so a performance killer.

The best would be to extend RecordReader and skip desired lines on initalize() method after calling parent's method. If split.getStart() == 0 then we read N dummy lines. Then we extend TextInputFormat to use this new class.

Then we could use it like this :

mapred.skipNLines=3;

CREATE TABLE mytablewithheader(a string, b string, c string)
STORED AS INPUTFORMAT 'com.mypackage.extdtextinputformat.ExtTextInputFormat'"+
        "OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

A good starting point:
http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
https://github.com/edwardcapriolo/DualInputFormat/blob/master/pom.xml

Not sure this serde package would be the best place to put this code.

This is solved since Hive 0.13.0

create external table testtable
( name    string
, message string
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ( 'separatorChar' = ','
                     , 'quoteChar'     = '"'
                     , 'escapeChar'    = '\\' )
stored as textfile
location '/path/to/testtable'
tblproperties ( 'skip.header.line.count' = '1'
              , 'skip.footer.line.count' = '2' );