elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir

Home Page:https://hexdocs.pm/explorer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Select header row number when reading CSV files

JonGretar opened this issue · comments

It would be helpful to add a :header_row option to the reading of CSV files. And that this is separate from the :skip_rows option.
It is not uncommon, especially when working with scientific equipment that the header might not be in the first row and also that there might be non-data rows after it.

As an example I point to the eddy covariance data example.

"TOA5","6843","CR3000","6843","CR3000.Std.22","CPU:CA_Flux__GOOD.CR3","24006","ts_Above"
"TIMESTAMP","RECORD","Ux","Uy","Uz","co2","h2o","Ts","press","diag_csat"
"TS","RN","m/s","m/s","m/s","mg/m^3","g/m^3","C","kPa","m/s"
"","","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp"
"2012-06-07 13:00:00.05",111868400,0.468,-0.9077501,0.1785,659.7584,9.530561,28.52527,100.1938,0
"2012-06-07 13:00:00.1",111868401,0.60275,-1.0795,0.283,660.0234,9.492132,28.51141,100.1938,0
....
  • Here the first row is data about the equipment.
  • The second row is the column names.
  • Third row are the units.
  • Fourth is other metadata
  • And then the data finally starts.

Of course reading this is not complex. Just use skip_rows: 1 and then delete the first two rows in the dataframe. But this is such a common pattern in scientific data that it might be worth considering supporting it inside the read_csv/2 function.

Of course I would also love to be able to save the units row as a series attribute. But that is a discussion for another issue. 😉

If this is supported in polars, then 👍 for a PR that adds this.

Hmmm

Polars has 'skip_rows_after_header'. I'll take a look at adding that.