vincentarelbundock / Rdatasets

A collection of datasets originally distributed in R packages

Home Page:https://vincentarelbundock.github.io/Rdatasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Exporting CSV with parameters row.names = FALSE and na = "" to avoid NA strings and get rid of row.names column?

mskyttner opened this issue · comments

What a nice collection of datasets! Thanks!

I tried loading data using duckdb and noticed that some numerical columns contain "NA" strings in the exported CSV. Those load nicely in R but often not elsewhere. I also saw that the row.names column seems to be included.

For example:

$ duckdb -c "from 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv' limit 5"
┌─────────┬─────────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬─────────┬───────┐
│ column0 │ species │  island   │ bill_length_mm │ … │ flipper_length_mm │ body_mass_g │   sex   │ year  │
│  int64  │ varchar │  varchar  │    varchar     │   │      varchar      │   varchar   │ varchar │ int64 │
├─────────┼─────────┼───────────┼────────────────┼───┼───────────────────┼─────────────┼─────────┼───────┤
│       1 │ Adelie  │ Torgersen │ 39.1           │ … │ 181               │ 3750        │ male    │  2007 │
│       2 │ Adelie  │ Torgersen │ 39.5           │ … │ 186               │ 3800        │ female  │  2007 │
│       3 │ Adelie  │ Torgersen │ 40.3           │ … │ 195               │ 3250        │ female  │  2007 │
│       4 │ Adelie  │ Torgersen │ NA             │ … │ NA                │ NA          │ NA      │  2007 │
│       5 │ Adelie  │ Torgersen │ 36.7           │ … │ 193               │ 3450        │ female  │  2007 │
├─────────┴─────────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴─────────┴───────┤
│ 5 rows                                                                             9 columns (8 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Compare the data types for columns bill_length_mm bill_depth_mm flipper_length_mm body_mass_g with the R dataset:

> palmerpenguins::penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
 2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
 3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
 4 Adelie  Torgersen           NA            NA                  NA          NA NA      2007

Could exporting using write.csv(x, na = "", row.names = FALSE) alleviate this issue? For example here.

Hi @mskyttner , thanks for the suggestions.

The na="" argument seems sensible. However, we cannot do row.names=FALSE because the row names are sometimes informative (even essential).

What do you think about about exporting row.names (if those exist) as a named column "rowname" (if the dataset is not a tibble, since with those row.names are not recommended and more rarely used)?

I fiddled with this snippet to try it out:

export_csv <- function(x, f, ...) {

  if (is.data.frame(x) & !tibble::is_tibble(x)) 
    x <- tibble::as_tibble(tibble::rownames_to_column(x, var = "rowname"))
  
  readr::write_csv(x, file = f, na = "", ...)
}

# export some datasets (first one is a tibble, last one has rownames)
palmerpenguins::penguins |> export_csv("/tmp/pp.csv")
datasets::iris |> export_csv("/tmp/iris.csv")
datasets::mtcars |> export_csv("/tmp/mtcars.csv")

The result in that little attempt looked promising, I think, see below (mtcars got a rowname column and column types are not demoted to varchar):

# import using non-R tool
duckdb_read_csv <- function(f) 
  system(sprintf("duckdb -c 'from \"%s\" limit 4;'", f))

> duckdb_read_csv("/tmp/pp.csv")
┌─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
│ speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear  │
│ varcharvarchardoubledoubleint64int64varcharint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
│ AdelieTorgersen39.118.71813750male2007 │
│ AdelieTorgersen39.517.41863800female2007 │
│ AdelieTorgersen40.318.01953250female2007 │
│ AdelieTorgersen │                │               │                   │             │         │  2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┘
> duckdb_read_csv("/tmp/iris.csv")
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies │
│    doubledoubledoubledoublevarchar │
├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤
│          5.13.51.40.2setosa  │
│          4.93.01.40.2setosa  │
│          4.73.21.30.2setosa  │
│          4.63.11.50.2setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
> duckdb_read_csv("/tmp/mtcars.csv")
┌────────────────┬────────┬───────┬────────┬───────┬────────┬────────┬────────┬───────┬───────┬───────┬───────┐
│    rownamempgcyldisphpdratwtqsecvsamgearcarb  │
│    varchardoubleint64doubleint64doubledoubledoubleint64int64int64int64 │
├────────────────┼────────┼───────┼────────┼───────┼────────┼────────┼────────┼───────┼───────┼───────┼───────┤
│ Mazda RX421.06160.01103.92.6216.460144 │
│ Mazda RX4 Wag21.06160.01103.92.87517.020144 │
│ Datsun 71022.84108.0933.852.3218.611141 │
│ Hornet 4 Drive21.46258.01103.083.21519.441031 │
└────────────────┴────────┴───────┴────────┴───────┴────────┴────────┴────────┴───────┴───────┴───────┴───────┘

Yep, I think it works. With the current website we get:

duckdb -c "from 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv' limit 5"
# ┌──────────┬─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
# │ rownames │ species │  island   │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │   sex   │ year  │
# │  int64   │ varchar │  varchar  │     double     │    double     │       int64       │    int64    │ varchar │ int64 │
# ├──────────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
# │        1 │ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male    │  2007 │
# │        2 │ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female  │  2007 │
# │        3 │ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female  │  2007 │
# │        4 │ Adelie  │ Torgersen │                │               │                   │             │         │  2007 │
# │        5 │ Adelie  │ Torgersen │           36.7 │          19.3 │               193 │        3450 │ female  │  2007 │
# └──────────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┘