atorus-research / Tplyr

Home Page:https://atorus-research.github.io/Tplyr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Adding custom summaries to count layer - Wilson confidence interval for %

Generalized opened this issue · comments

Dear Authors,

I know that custom summaries can be provided on the descriptive layer.

How can I do the same on the counting one?

I need a Wilson confidence interval (binom::binom.wilson()) for each %. To access it I need the current n (events or subjects, depending on scenario) and corresponding denominator. I need it both at the items and total level.

Ideally, it should also support nested (hierarchical) counting, so the most nested items sum up to 100% with respect to its parent total, and each subsequent group parent total % is calculated - again - with respect to its parent group. The top most % is 100% then.

My data:

data <-
structure(list(PatID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 
2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4), SOC = c("SOC1", "SOC1", 
"SOC1", "SOC1", "SOC2", "SOC2", "SOC2", "SOC3", "SOC3", "SOC1", 
"SOC1", "SOC1", "SOC2", "SOC5", "SOC5", "SOC10", "SOC11", "SOC11", 
"SOC1", "SOC1", "SOC1", "SOC1", "SOC2", "SOC2"), PT = c("PT1-1", 
"PT1-1", "PT1-2", "PT1-3", "PT2-1", "PT2-1", "PT2-2", "PT3-3", 
"PT3-4", "PT1-1", "PT1-1", "PT1-2", NA, "PT5-10", "PT5-2", "PT10-2", 
"PT11-1", "PT11-2", "PT1-1", "PT1-1", "PT1-2", "PT1-3", "PT2-1", 
"PT2-2"), Visit = c(1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 3, 1, 
1, 1, 1, 3, 1, 1, 1, 1, 1, 2), ARM = c("SoC", "SoC", "SoC", "SoC", 
"SoC", "SoC", "SoC", "SoC", "SoC", "Arm1", "Arm1", "Arm1", "Arm1", 
"Arm1", "Arm1", "Arm1", "Arm1", "Arm1", "SoC", "SoC", "SoC", 
"SoC", "SoC", "SoC"), SIDE = c("LEFT", "LEFT", "LEFT", "RIGHT", 
"RIGHT", "RIGHT", "RIGHT", "RIGHT", "LEFT", "RIGHT", "RIGHT", 
"LEFT", "LEFT", "LEFT", "LEFT", "LEFT", "RIGHT", "RIGHT", "LEFT", 
"LEFT", "RIGHT", "LEFT", "RIGHT", "RIGHT"), Recovered = c("YES", 
"YES", "YES", "YES", "NO", NA, NA, NA, "NO", "NO", NA, NA, "NO", 
"NO", NA, "YES", NA, "YES", NA, "YES", "YES", "YES", "NO", "NO"
), data = c(1, 1, 1, NA, 5, 5, 5, 5, 1, 10, 10, 20, 20, 20, 20, 
20, 10, 10, 1, 1, 5, 1, 5, 5)), row.names = c(NA, 24L), class = "data.frame")
> data %>% 
   unite("ARM_SIDE", c(ARM, SIDE)) %>% 
   tplyr_table(ARM_SIDE) %>% 
   add_total_group() %>% 
   add_layer(
     layer = group_count(SOC)) %>% 
   build() 

# A tibble: 6 × 8
  row_label1 var1_Arm1_LEFT var1_Arm1_RIGHT var1_SoC_LEFT var1_SoC_RIGHT var1_Total    ord_layer_index ord_layer_1
  <chr>      <chr>          <chr>           <chr>         <chr>          <chr>                   <int>       <dbl>
1 SOC1       " 1 ( 20.0%)"  " 2 ( 50.0%)"   " 6 ( 85.7%)" " 2 ( 25.0%)"  "11 ( 45.8%)"               1           1
2 SOC10      " 1 ( 20.0%)"  " 0 (  0.0%)"   " 0 (  0.0%)" " 0 (  0.0%)"  " 1 (  4.2%)"               1           2
3 SOC11      " 0 (  0.0%)"  " 2 ( 50.0%)"   " 0 (  0.0%)" " 0 (  0.0%)"  " 2 (  8.3%)"               1           3
4 SOC2       " 1 ( 20.0%)"  " 0 (  0.0%)"   " 0 (  0.0%)" " 5 ( 62.5%)"  " 6 ( 25.0%)"               1           4
5 SOC3       " 0 (  0.0%)"  " 0 (  0.0%)"   " 1 ( 14.3%)" " 1 ( 12.5%)"  " 2 (  8.3%)"               1           5
6 SOC5       " 2 ( 40.0%)"  " 0 (  0.0%)"   " 0 (  0.0%)" " 0 (  0.0%)"  " 2 (  8.3%)"               1           6

Where should I put the procedure calculating Wilson score CI? Ideally I want to merge it with the current cell content after the "\n" character (will be formatted later via flextable package).

Another option is to obtain a 2-row layout (it's quite common) like below:

  row_label1 Stat       var1_Arm1_LEFT var1_Arm1_RIGHT var1_SoC_LEFT var1_SoC_RIGHT var1_Total    ord_layer_index ord_layer_1
  <chr>        <chr>    <chr>          <chr>           <chr>         <chr>          <chr>                   <int>       <dbl>
 SOC1       "n (%)"    " xx.x ; xx.x"  " 2 ( 50.0%)"   " 6 ( 85.7%)" " 2 ( 25.0%)"  "11 ( 45.8%)"               1           1
 SOC1       "95% CI"  "xx.x ; xx.x"  "xx.x ; xx.x"  etc.

 SOC10       "n (%)"    " xx.x ; xx.x"  " 2 ( 50.0%)"   " 6 ( 85.7%)" " 2 ( 25.0%)"  "11 ( 45.8%)"               1           1
 SOC10       "95% CI"  "xx.x ; xx.x"  "xx.x ; xx.x"  etc.

OK, let's start from something simple and obtain the CIs. Then I will think how to merge it.

I found that I can access all the components used for the calculation of % (nominator and denominator).
Let's use it:

> data %>% 
+   unite("ARM_SIDE", c(ARM, SIDE)) %>% 
+   tplyr_table(ARM_SIDE) %>% 
+   add_total_group() %>% 
+   add_layer(
+     layer = group_count(SOC) %>% 
+       set_distinct_by(PatID) %>% 
+         set_format_strings(f_str("xx.x ; xx.x", binom::binom.wilson(x = distinct, n=distinct_total)$lower, 
+                                  binom::binom.wilson(n = distinct, x=distinct_total)$upper))) %>% 
+   build() 
Error: In `f_str` all values submitted via `...` must be variable names.

Failed. I cannot provide just the results I want. It must be the closed set of recognizable variable names.
It's also highly inefficient to call the same function multiple times, so let's try with(), just to check:

> data %>% 
+   unite("ARM_SIDE", c(ARM, SIDE)) %>% 
+   tplyr_table(ARM_SIDE) %>% 
+   add_total_group() %>% 
+   add_layer(
+     layer = group_count(SOC) %>% 
+       set_distinct_by(PatID) %>% 
+         set_format_strings( with(binom::binom.wilson(x = distinct, n=distinct_total), f_str("xx.x ; xx.x", lower, upper)))) %>% 
+   build() 
Error: In `set_format_string` entry 1 is not an `f_str` object. All assignmentes made within `set_format_string` must be made using the function `f_str`. See the `f_str` documentation.

OK, this won't pass me further.


EDIT: I tried via set_custom_summaries (inefficient as previously), but didn't succeed. It works only with descriptive summaries.

> data %>% 
+   unite("ARM_SIDE", c(ARM, SIDE)) %>% 
+   tplyr_table(ARM_SIDE) %>% 
+   add_total_group() %>% 
+   add_layer(
+     layer = group_count(SOC) %>% 
+       set_distinct_by(PatID) %>% 
+         set_custom_summaries(
+             Wilson_LO = binom::binom.wilson(x = distinct, n=distinct_total)$lower,
+             Wilson_HI = binom::binom.wilson(x = distinct, n=distinct_total)$upper) %>% 
+         set_format_strings(f_str("xx.x ; xx.x", Wilson_LO, Wilson_HI))) %>% 
+   build() 
Error in `assert_inherits_class()`:
! Argument `e` does not inherit "desc_layer". Classes: tplyr_layer, count_layer, environment
Run `rlang::last_error()` to see where the error occurred.
Warning message:
In max(trc$indices) : no non-missing arguments to max; returning -Inf

Is there any other place where I can put the calculation of the CI and re-use the returned lower/upper bounds of the CI?


EDIT: I tried defining my own "add-in" that provides the environment with necessary variables (understandable by f_str), but the package is well protected from using external functions, and also the environment is locked from adding them.

> f <- function (e, x) 
 {
   env_bind(e, xx = 2*x)
   e
 }

> data %>% 
   unite("ARM_SIDE", c(ARM, SIDE)) %>% 
   tplyr_table(ARM_SIDE) %>% 
   add_total_group() %>% 
   add_layer(
     layer = group_count(SOC) %>% 
       set_distinct_by(PatID) %>% 
       f(2) %>% 
       set_format_strings(f_str("xx.x", xx))) %>% 
   build() 
Error: Functions called within `add_layer` must be part of `Tplyr`

OK, so I tried to "hack" the Tplyr and add my own function, so it's recognized by the engine, but it turned out the protection goes farther:

# A little dummy function, only to check if I can provide my own new variable (e.g. holding the lower or upper CI bound).
# If I could do this, then I could generate any kind of summary I need and merge it with the rest of the layer.

f <- function (e, x) 
{
  env_bind(e, xx = 2*x)
  e
}

rlang::env_unlock(environment(Tplyr::add_layer))  #unlocking the Tplyr environment
ns <- environment(Tplyr::add_layer)
assign("f", f, envir = ns) 
environment(f) <- ns  # some of these calls may be redundant, just for "safety"
assignInNamespace("f", f, ns = ns) 
namespaceExport(ns, "f")

# Now let's try using it with Tplyr

> data %>% 
   unite("ARM_SIDE", c(ARM, SIDE)) %>% 
   tplyr_table(ARM_SIDE) %>% 
   add_total_group() %>% 
   add_layer(
     layer = group_count(SOC) %>% 
       set_distinct_by(PatID) %>% 
       f(2) %>%  # some number just to trace the result
       set_format_strings(f_str("xx.x", xx))) %>% 
   build() 
Error: f_str for n_counts in a count_layer can only be n, pct, distinct, distinct_pct, total, or distinct_total

Good! Now f() is part of Tplyr and passes the getNamespaceExports("Tplyr") checking.
Bad! The names of variables that can be used with this layer are fixed.

obraz

And also putting it here seems to be nonsensical, as the nominators and denominators aren't known at this level.

It could be on the level of f_str(), combined via paste() or sprintf():

data %>% 
  unite("ARM_SIDE", c(ARM, SIDE)) %>% 
  tplyr_table(ARM_SIDE) %>% 
  add_total_group() %>% 
  add_layer(
    layer = group_count(SOC) %>% 
      set_distinct_by(PatID) %>% 
      set_format_strings(f_str( paste("xx.x", 
                                      round(f1(distinct_n, distinct_total)$lower, 2)), n))) %>% 
  build() 

But the distinct_n and distinct_total isn't known in the paste function...

f_str won't allow for other variables, set_format_strings won't allow for anything else than f_str.

And here my "creativity" ends. Briefly - while the numerical summary is extensible, the categorical one is not. Would you consider enabling this as well?

This should be done in a more general way - to allow one for using any custom functions taking the n, distinct n, and appropriate denominators, producing custom columns. It's very common to add various CIs (for %, for their differences, like Miettinen-Nurminen), p-values of various tests, effect size of various kinds (for difference in proportions) along with their own bootstrapped CI and so on. So if there is any common mechanism enabling the user to use own, custom functions returning custom columns, that can be reused within the counting layer (and maybe also the shift layer), it would be really great!

To make it absolutely clear - I am not asking for making Tplyr a statistical suite. This would be a nonsense, as it's just for tabular summaries. I rather mean enabling the user to provide own summaries instead.