question: How to simulate correlated difference scores?

Question

question: How to simulate correlated difference scores?

maciejkos opened this issue 2 years ago · comments

Hi,

First off — thank you for creating this package! Would you be able to answer my question about simulating correlated difference scores?

To perform power analysis, I want to simulate two sets of scores with their differences correlated.

There are two tests, x and y, administered at time points 1 and 2:

x1 and x2 are scores on test x administered at times 1 and 2, respectively
y1 and y2 are scores on test y administered at times 1 and 2, respectively

Given some arbitrary values of x1 and some correlations between:

x1 and x2,
x1 and y1,

I want to simulate data where (x2 - x1) and (y2 - y1) are correlated, with the correlation equal to some corr_deltas.

Could you please tell me if this is possible to do using your package and how to go about it?

Thanks!
— Maciej

Lisa DeBruine · Answer 1 · Tue Mar 15 2022 03:40:41 GMT+0800 (China Standard Time)

Could you send your source data parameters like this? Where the x1:y_diff columns are the correlation matrix, and the means and SDs in the last two columns? Put NA for anything you don't know. I think the x_diff and y_diff correlations might be an emergent property of the x1, x2, y1, and y2 correlations, but need to check with some real data.

     n    var    x1   x2    y1   y2 x_diff y_diff mean   sd
1 1000     x1  1.00 0.60  0.40 0.20  -0.45  -0.22    0 1.00
2 1000     x2  0.60 1.00  0.20 0.40   0.45   0.22    0 1.00
3 1000     y1  0.40 0.20  1.00 0.60  -0.22  -0.45    0 1.00
4 1000     y2  0.20 0.40  0.60 1.00   0.22   0.45    0 1.00
5 1000 x_diff -0.45 0.45 -0.22 0.22   1.00   0.50    0 0.89
6 1000 y_diff -0.22 0.22 -0.45 0.45   0.50   1.00    0 0.89

Lisa DeBruine · Answer 2 · Tue Mar 15 2022 03:46:56 GMT+0800 (China Standard Time)

For example, the following code always results in the exact same correlation between x_diff and y_diff for any given correlation matrix for x1:y2 (set empirical = TRUE to make sure each sample has the exact mean, sd, and r parameters). So there must be a systematic relationship between the parameters of x1:y2 and the parameters of the difference scores.

library(faux)

r <- c(.6, .4, .2,
           .2, .4,
               .6)

dat <- rnorm_multi(
  n = 1000,
  varnames = c("x1", "x2", "y1", "y2"),
  r = r,
  empirical = TRUE
)

dat$x_diff <- dat$x2 - dat$x1
dat$y_diff <- dat$y2 - dat$y1

check_sim_stats(dat)

Maciej Kos · Answer 3 · Tue Mar 15 2022 05:48:56 GMT+0800 (China Standard Time)

Thank you for responding so quickly, Lisa!

I'll start by giving some additional information and answer your question next.

Context

The tests:

Test x is our gold standard. Test y is a novel test, which hasn't been used yet, and we know nothing about it.
Both tests are administered at times 1 and 2, six months apart, to the same sample of 40 participants.

Difference scores:

I expect that test x will yield higher scores at time 2 because participants will undergo medical treatment. I don't know how large the improvement will be. Qualitatively, clinicians say that it will be "very large."
If test y works correctly, it will also yield higher scores at time 2.

The study objective is to check whether improvements measured by test x correlate with improvements measured by test y.

The study hasn't started yet. I want to analyze how much power I will have given the sample size, different potential effect sizes, and different properties of test y, which we know nothing about. For this reason, I want to simulate the data with the difference scores correlated.

Could you send your source data parameters like this?


# Test x is our gold standard. 
# Results of test x could have these parameters. 

mu_x1 = 5
sd_x1 = 2.5
delta_x1_x2 = 3 # we expect to see a higher score at time 2
mu_x2 = mu_x1 + delta_x1_x2 # we expect to see a higher score at time 2
sd_x2 = sd_x1
corr_x1_x2 = 0.8 # test x is our gold standard so we expect high correlation between measurements at 1 and 2

# Test y is a novel test. We don't know anything about it.
# For simulation, let's assume that:

mu_y1 = mu_x1
sd_y1 = sd_x1 # = 2.5
corr_x1_y1 = 0.8

# I want to find y2 such that r(x2-x1,y2-y1) = corr_deltas

set.seed(321)
df <- rnorm_multi(
  varnames = "x1", n = 100,vars = 1, mu = mu_x1, sd = sd_x1,) %>% 
  mutate(x2 = rnorm_pre(x1, mu = mu_x2, sd = sd_x2, r = corr_x1_x2), # x2 correlates with x1
         y1 = rnorm_pre(x1, mu = mu_y1, sd = sd_x1, r = corr_x1_y1), # y1 correlates with x1
         y2 = NA,  # y2 correlates with y1, r(x2-x1,y2-y1) = corr_delta
         x_diff = x2 - x1,
         y_diff = y2 - y1)

check_sim_stats(df)

    n    var    x1   x2    y1 x_diff y_diff mean   sd
1 100     x1  1.00 0.75  0.81  -0.20     NA 5.02 2.39
2 100     x2  0.75 1.00  0.58   0.50     NA 7.77 2.70
3 100     y1  0.81 0.58  1.00  -0.19     NA 6.83 2.57
4 100 x_diff -0.20 0.50 -0.19   1.00     NA 2.75 1.81
5 100 y_diff    NA   NA    NA     NA      1   NA   NA

write.csv(file = "source_data_parameters.csv", x = as.data.frame(check_sim_stats(df))) # file attached

source_data_parameters.csv

I could provide additional parameters of y2 --- see below ---, but I am not sure if it would possible to find values of y2 such that r(x_diff, y_diff) = corr_deltas, as you hinted at in your reply.

delta_y1_y2 = 3
mu_y2 = mu_y1 + delta_y1_y2
sd_y2 = sd_x1 # =2.5
corr_y1_y2 = 0.9

I was hoping I could do something like the following, but I couldn't get it to work.

corr_deltas = 0.9

rnorm_multi(
  varnames = "x1", n = 100,vars = 1, mu = mu_x1, sd = sd_x1,) %>% 
  mutate(x2 = rnorm_pre(x1, mu = mu_x2, sd = sd_x2, r = corr_x1_x2), # x2 correlates with x1
         y1 = rnorm_pre(x1, mu = mu_y1, sd = sd_x1, r = corr_x1_y1), # y1 correlates with x1
         y2 = rnorm_pre(x = c(x2-x1,y2-y1), r = corr_deltas) # find y2 such that r(x2-x1,y2-y1) = corr_deltas
         )

Thank you again for your help!
— Maciej

Maciej Kos · Answer 4 · Tue Mar 15 2022 13:11:17 GMT+0800 (China Standard Time)

After some fiddling, I think I have done it. It does feel a bit like a hack, so I will need to think about it more. I might also need to add some more parameters.

I would appreciate your feedback, Lisa.

Thank you for your time,
— Maciej


# Test x is our gold standard. 
# Results of test x could have these parameters. 

mu_x1 = 5
sd_x1 = 2.5
delta_x1_x2 = 3 # we expect to see a higher score at time 2
mu_x2 = mu_x1 + delta_x1_x2 # we expect to see a higher score at time 2
sd_x2 = sd_x1
corr_x1_x2 = 0.8 # test x is our gold standard so we expect high correlation between measurements at 1 and 2

# Test y is a novel test. We don't know anything about it.
# For simulation, let's assume that:

mu_y1 = mu_x1
sd_y1 = sd_x1 + 1 # = 3.5

corr_x1_y1 = 0.8

corr_deltas = 0.9876
df <- 
  rnorm_multi(
    varnames = c("x1", "y1"), n = 40, vars = 2, mu = c(mu_x1, mu_y1), sd = c(sd_x1,sd_y1), r = (corr_x1_y1), empirical = TRUE) %>%
  mutate(
    x2 = rnorm_pre(x1, mu = mu_x1 + 3, sd = sd_x2, r = corr_x1_x2, empirical = TRUE),
    delta_x = x2 - x1,
    delta_y_correlated_with_delta_x = rnorm_pre(delta_x, 
                                                r = corr_deltas, empirical = TRUE),
    y2 = y1 + delta_y_correlated_with_delta_x,
    delta_y = y2 - y1,
  )

cor.test(df$delta_x, df$delta_y, method="person")

	Pearson's product-moment correlation

data:  df$delta_x and df$delta_y
t = 38.779, df = 38, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9765110 0.9934713
sample estimates:
   cor 
0.9876

check_sim_stats(df)

   n                             var    x1    y1   x2 delta_x delta_y_correlated_with_delta_x   y2 delta_y mean   sd
1 40                              x1  1.00  0.80 0.80   -0.32                           -0.34 0.73   -0.34    5 2.50
2 40                              y1  0.80  1.00 0.66   -0.22                           -0.26 0.96   -0.26    5 3.50
3 40                              x2  0.80  0.66 1.00    0.32                            0.28 0.77    0.28    8 2.50
4 40                         delta_x -0.32 -0.22 0.32    1.00                            0.99 0.06    0.99    3 1.58
5 40 delta_y_correlated_with_delta_x -0.34 -0.26 0.28    0.99                            1.00 0.02    1.00    0 1.00
6 40                              y2  0.73  0.96 0.77    0.06                            0.02 1.00    0.02    5 3.38
7 40                         delta_y -0.34 -0.26 0.28    0.99                            1.00 0.02    1.00    0 1.00

Lisa DeBruine · Answer 5 · Tue Mar 15 2022 19:02:18 GMT+0800 (China Standard Time)

Here's one attempt (your parameters are at the top). I set empirical = TRUE to make it easier to check that you set the simulation parameters right (they should be exactly the same in check_sim_stats), but you can change it to FALSE (the default) to make those the population parameters for simulations.

The pattern is to simulate x1 and x2, calculate x_diff, simulate y1 in relation to x1, simulate y_diff in relation to x_diff, and calculate y2 from y1 + y_diff.

library(faux)
library(tidyverse)

# Test x is our gold standard. 
# Results of test x could have these parameters. 

mu_x1 = 5
sd_x1 = 2.5
delta_x1_x2 = 3 # we expect to see a higher score at time 2
mu_x2 = mu_x1 + delta_x1_x2 # we expect to see a higher score at time 2
sd_x2 = sd_x1
corr_x1_x2 = 0.8 # test x is our gold standard so we expect high correlation between measurements at 1 and 2

# Test y is a novel test. We don't know anything about it.
# For simulation, let's assume that:

mu_y1 = mu_x1
sd_y1 = sd_x1 # = 2.5
corr_x1_y1 = 0.8

# I want to find y2 such that r(x2-x1,y2-y1) = corr_deltas

# additional needed parameters
corr_deltas = .5
y_diff = 3 
y_diff_sd = 1.5
# setting empirical = TRUE treats mu, sd, and r as sample, not population, parameters
empirical = TRUE

df <- rnorm_multi(
  n = 100,
  varnames = c("x1", "x2"),  
  mu = c(mu_x1, mu_x2), 
  sd = c(sd_x1, sd_x2), 
  r = corr_x1_x2,
  empirical = empirical) %>% 
  mutate(
    x_diff = x2 - x1,
    y1 = rnorm_pre(x1, mu = mu_y1, sd = sd_y1, 
                   r = corr_x1_y1, empirical = empirical), 
    y_diff = rnorm_pre(x_diff, mu = y_diff, sd = y_diff_sd,
                       r = corr_deltas, empirical = empirical),
    y2 = y1 + y_diff
  ) %>%
  select(x1, x2, y1, y2, x_diff, y_diff)

check_sim_stats(df)

Lisa DeBruine · Answer 6 · Tue Mar 15 2022 19:05:50 GMT+0800 (China Standard Time)

Oops, I didn't see your reply when I posted my response. I think we solved it basically the same way.

Maciej Kos · Answer 7 · Wed Mar 16 2022 10:11:12 GMT+0800 (China Standard Time)

Brilliant - thank you!
— M.