Select header row number when reading CSV files

Question

Select header row number when reading CSV files

JonGretar opened this issue 8 months ago · comments

Jon Gretar Borgthorsson commented 8 months ago

It would be helpful to add a :header_row option to the reading of CSV files. And that this is separate from the :skip_rows option.
It is not uncommon, especially when working with scientific equipment that the header might not be in the first row and also that there might be non-data rows after it.

As an example I point to the eddy covariance data example.

"TOA5","6843","CR3000","6843","CR3000.Std.22","CPU:CA_Flux__GOOD.CR3","24006","ts_Above"
"TIMESTAMP","RECORD","Ux","Uy","Uz","co2","h2o","Ts","press","diag_csat"
"TS","RN","m/s","m/s","m/s","mg/m^3","g/m^3","C","kPa","m/s"
"","","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp"
"2012-06-07 13:00:00.05",111868400,0.468,-0.9077501,0.1785,659.7584,9.530561,28.52527,100.1938,0
"2012-06-07 13:00:00.1",111868401,0.60275,-1.0795,0.283,660.0234,9.492132,28.51141,100.1938,0
....

Here the first row is data about the equipment.
The second row is the column names.
Third row are the units.
Fourth is other metadata
And then the data finally starts.

Of course reading this is not complex. Just use skip_rows: 1 and then delete the first two rows in the dataframe. But this is such a common pattern in scientific data that it might be worth considering supporting it inside the read_csv/2 function.

Of course I would also love to be able to save the units row as a series attribute. But that is a discussion for another issue. 😉

José Valim · Answer 1 · Thu Dec 21 2023 20:16:39 GMT+0800 (China Standard Time)

If this is supported in polars, then 👍 for a PR that adds this.

Jon Gretar Borgthorsson · Answer 2 · Thu Dec 21 2023 21:52:39 GMT+0800 (China Standard Time)

Hmmm

Polars has 'skip_rows_after_header'. I'll take a look at adding that.