vincentarelbundock / Rdatasets

A collection of datasets originally distributed in R packages

Home Page:https://vincentarelbundock.github.io/Rdatasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Minor discrepancies between subfolder csvs and master sheet

ArthurSpirling opened this issue · comments

Hello @vincentarelbundock -- thanks so much for providing these data.

I did a very quick scan through the data and documentation for the same. In particular, I was looking for any discrepancies between this main sheet and the names of the data sets themselves (as in name.csv) stored in the subfolders.

Here are some that are found that appear in the data as csvs, but not documented on the sheet. This was very rough and ready, and I might have missed something, but just in case it's helpful for your sweeps --

"aldh2" "apoeapoc" "bomregions2011" "bomregions2012"
"bomsoi2001" "cf" "cnv" "crohn"
"Damian" "fa" "fsnps"
"head.injury" "hla" "inf1"
"jma.cojo" "l51" "lukas" "mao"
"meyer" "mfblong" "mr" "nep499"
"PD"

For example, bomregions2012.csv appears in the DAAG subfolder, but not on that master sheet. And indeed, it has documentation here.

Again, thanks for all this work!

Ah, also, there's an entry for hdma and hmda both from Ecdat and both seemingly identical descriptions (?) and docs.

Update: DAAG contains both a head.injury.csv and a headInjury.csv --- which may be identical? not sure.

Thanks for the report. Glad the website is useful!

I looked at a few of these and my best guess is this:

  1. My script never calls git rm on anything, so datasets stay there forever. This is important in case someone links to the URL in one of their scripts.
  2. However, the main sheet index is created every time I run the script, and that's based on what is currently available in the packages. I think that also makes sense: If a package maintainer removes a dataset, I may still want to keep permanent links to protect users, but it's probably "polite" to not advertise the dataset anymore.

The few datasets I checked didn't seem to be available in their packages anymore. And in the head.injury case, the DAAG changelog says it was a duplicate and was removed:

https://github.com/cran/DAAG/blob/master/NEWS#L27

Again, I didn't check them all, but my provisional conclusion is that things are probably fine as-is. Makes sense?

Sounds good, thanks very much.