vincentarelbundock / Rdatasets

A collection of datasets originally distributed in R packages

Home Page:https://vincentarelbundock.github.io/Rdatasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potential esoph data inaccuracy.

AJFOWLER opened this issue · comments

Hello!

I've been working through some tutorials using the data provided by Tuyns.[1] It is described in some detail in both this stata bulletin about the population attributable fraction, and in these lecture slides.

In these descriptions, the sum of patients should be 975 (200 cases and 775 controls).

However, the esoph data:

library(datasets)
dat <- esoph
sum(dat$ncontrols) #=975
sum(dat$ncases) #= 200

This gives a total of 1175 records. I think an error has slipped in whereby controls is actually the total number of records (sum of cases and controls). If esoph$ncases is subtracted from esoph$ncontrols then this provides a true number of controls that matches the above descriptions.

One thing to add here, I am 99% sure that the esoph data comes from the Tuyns paper for a couple of reasons, but can't find the exact reference as I don't have access to the Breslow book quoted in the r documentation.

The reasons are:

  1. The number of combinations is exactly the same (88 combinations of age, alcohol status, tobacco status)

  2. The town of Ille-et Vilaine is mentioned in the r documentation of esoph

  3. Counts and features otherwise match precisely, and odds calculated on the corrected data match those provided by tutorials.

A solution to this would be to subtract ncases from ncontrols to provide a total number column, a cases column and a controls column.

Happy to open a PR to do this if helpful. I couldn't find a github repo for the core-R datasets code so thought best to open this issue here first as I see you have that dataset included.

[1] Tuyns, A. J., G. Pequignot, and O. M. Jensen. 1977. Le cancer de l’oesophage en Ille-et Vilaine en fonction des niveaux de consommation d’alcool et de tabac. Bulletin of Cancer 64: 45–60

Unfortunately, Rdatasets is a completely independent effort. I have no special way of communicating with the original data/package maintainers. I don't even know who maintains the esoph data. Your best bet might to look for a way to open a ticket on R-forge somewhere, or to post on one of the R devel mailing lists.

Sorry.

Thanks Vincent, apologies for bothering you with this.

No worries. I'm just sorry I can't do much more. RDatasets is really just mooching off other people's great work ;)