Problem with Overall N (email reported issue)

Question

Problem with Overall N (email reported issue)

mstackhouse opened this issue 7 months ago · comments

Michael Stackhouse commented 7 months ago

Tplyr - Problem with Overall N

I found what I think is a bug in Tplyr. The problem occurs when you have subjects in a treatment column but there are no adverse events for those subjects. In that case, the subjects will not be included in the Overall column N’s and the percent calculation is affected.
The attached R script will show an example when run within the package using ADSL and ADAE loaded from the package vignettes folder.
I subset to only Females and created two scenarios in the script:

2 subjects in the high dose column with NO adverse events in the AE data:
If you look at the resulting “big_Ns” dataset it shows 103 as the N for Overall, but shouldn’t it be 105 (sum of all treatment Ns)? The two subjects in the High Dose column are excluded from Overall N and denominator because they didn’t have adverse events.

The same two subjects in the high dose column but this time ONE subject has ONE adverse event in the AE data:
If you look at the resulting “big_Ns_ONE” dataset it shows 105 as the N for Overall. The two subjects in the High Dose column are included in the Overall N and denominator because ONE of them has an adverse event.

library(tidyverse)
library(Tplyr)


load(file = "./vignettes/adsl.Rdata")
load(file = "./vignettes/adae.Rdata")

# Exlude all but two high dose subjects for this example
adsl_sub <- adsl %>% filter(SEX == 'F' & !TRT01A == 'Xanomeline High Dose')
two_highdose_F <- adsl %>% filter(USUBJID %in% c('01-701-1034','01-701-1239'))

adsl_subset <- rbind(adsl_sub,two_highdose_F) # Add the two back in

adae <- adae %>%
  # merge TRT01A from adsl
  left_join(adsl[,c("USUBJID", "TRT01A")], by ="USUBJID")

# Exclude AE's for all high dose subjects for this example
adae_subset <- adae %>% filter(!TRT01A == 'Xanomeline High Dose')

data <- adae_subset %>%
  mutate(
         # assuming everything is TEAE for this count example
         teae = "TEAE"  )


# Counts with no AEs for the high dose column
tplyr_meta <- data %>% tplyr_table(TRT01A) %>%
  add_total_group("Overall") %>%
  set_pop_data(adsl_subset) %>%
  set_pop_treat_var(TRT01A) %>%
  set_distinct_by(USUBJID) %>%
  set_count_layer_formats(n_counts = f_str("xx (xx.x)", distinct_n, distinct_pct)) %>%
  add_layer(group_count(teae, by = "Participants with TEAE"))

# Since there are no AE's for High Dose subjects in the AE data,
# this Overall count EXCLUDES both of the subjects in the high dose
# column and thus the percentages are based on 103 subjects rather than 105 subjects, which I
# think is incorrect.
dat <- build(tplyr_meta, metadata = TRUE)
big_Ns <- header_n(tplyr_meta)





# Now add one AE for ONE of the two high dose subjects back into the AE data
one_highdose_AE_F <- adae %>% filter(USUBJID == "01-701-1239" & AEDECOD == "SKIN ODOUR ABNORMAL")
adae_subset_ONE <- rbind(adae_subset, one_highdose_AE_F)

dataONE <- adae_subset_ONE %>%
  mutate(
         # assuming everything is TEAE for this count example
         teae = "TEAE"  )


# Counts with ONLY ONE AE for ONE subject in the high dose column
tplyr_meta_ONE <- dataONE %>% tplyr_table(TRT01A) %>%
  add_total_group("Overall") %>%
  set_pop_data(adsl_subset) %>%
  set_pop_treat_var(TRT01A) %>%
  set_distinct_by(USUBJID) %>%
  set_count_layer_formats(n_counts = f_str("xx (xx.x)", distinct_n, distinct_pct)) %>%
  add_layer(group_count(teae, by = "Participants with TEAE"))


# This Overall count includes both subjects in the high dose column even though
# only one of them has an AE and thus the percentages are based on 105 total subjects
# which I think is correct.
dat_ONE <- build(tplyr_meta_ONE, metadata = TRUE)
big_Ns_ONE <- header_n(tplyr_meta_ONE)

Michael Stackhouse · Answer 1 · Sat Jan 27 2024 04:48:10 GMT+0800 (China Standard Time)

This is actually not a bug - but rather an issue with the order of operations and an opportunity to improve documentation.

tplyr_meta <- data %>% tplyr_table(TRT01A) %>%
  set_pop_data(adsl_subset) %>%
  add_total_group("Overall") %>%
  set_pop_treat_var(TRT01A) %>%
  set_distinct_by(USUBJID) %>%
  set_count_layer_formats(n_counts = f_str("xx (xx.x)", distinct_n, distinct_pct)) %>%
  add_layer(group_count(teae, by = "Participants with TEAE"))

dat <- build(tplyr_meta, metadata = TRUE)
big_Ns <- header_n(tplyr_meta)

The add_total_group() function extracts the levels from population data at the time of call, but does all the additional pre-processing at the time of build. If add_total_group() is called before set_pop_data(), internally the population data is still marked as the initial target dataset, which in this case is adae. By calling add_total_group() after set_pop_data(), the levels are instead pulled from adsl and the results are correct.

4 years after release, we recognize the flaw in this design - and that's why in the concepts we're drawing up for tplyr2, we want to move all data processing to the build stage - because the order of calling function is definitely not intuitive without understanding internals.