Add a property to skip the first N rows

Question

Add a property to skip the first N rows

ccady opened this issue 12 years ago · comments

Many CSV files have a "title row" e.g. :

id,lastname,firstname
"12323","Washington","George"
"12343","Lincoln","Abraham"

It would be nice to be able to set a "linesToSkip" property to skip the first N rows:

row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "linesToSkip" = 1
)

(I am so not set up to do any Java coding -- If I were, I'd do it myself and submit a patch.)

Luis F. M. Rosa · Answer 1 · Thu Jan 30 2014 20:23:51 GMT+0800 (China Standard Time)

This functionality cannot be implemented at SerDe level as it controls only the row format. There is no information of current position in file for deserialize() method to do that.

To implement a linesToSkip functionality, we should implement a custom input format (perhaps extending TextInputFormat: http://hadoop.apache.org/docs/r1.2.1/api/index.html).

Rickard Cardell · Answer 2 · Fri Feb 07 2014 20:27:34 GMT+0800 (China Standard Time)

what about defining the csv-headers in the table's serdeproperties? The serde can then skip the line which matches the header-spec:

create table 
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "header" = "id,lastname,firstname"
)

As of now I need a step that removes the header for each csv-file which is quite cumbersome: One table and a serde is, without this feature, not enough to parse a csv-file.

EDIT: I forgot that a serde always need to output one row (unless it throws an exception but that is rather ugly).

Luis F. M. Rosa · Answer 3 · Sat Feb 08 2014 00:04:28 GMT+0800 (China Standard Time)

The problem is that it will make a string comparison for every row in the file, so a performance killer.

The best would be to extend RecordReader and skip desired lines on initalize() method after calling parent's method. If split.getStart() == 0 then we read N dummy lines. Then we extend TextInputFormat to use this new class.

Then we could use it like this :

mapred.skipNLines=3;

CREATE TABLE mytablewithheader(a string, b string, c string)
STORED AS INPUTFORMAT 'com.mypackage.extdtextinputformat.ExtTextInputFormat'"+
        "OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

A good starting point:
http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
https://github.com/edwardcapriolo/DualInputFormat/blob/master/pom.xml

Not sure this serde package would be the best place to put this code.

Juraj Sofranko · Answer 4 · Wed May 25 2016 01:00:17 GMT+0800 (China Standard Time)

This is solved since Hive 0.13.0

create external table testtable
( name    string
, message string
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ( 'separatorChar' = ','
                     , 'quoteChar'     = '"'
                     , 'escapeChar'    = '\\' )
stored as textfile
location '/path/to/testtable'
tblproperties ( 'skip.header.line.count' = '1'
              , 'skip.footer.line.count' = '2' );