question: How to simulate correlated difference scores?
maciejkos opened this issue · comments
Hi,
First off — thank you for creating this package! Would you be able to answer my question about simulating correlated difference scores?
To perform power analysis, I want to simulate two sets of scores with their differences correlated.
There are two tests, x and y, administered at time points 1 and 2:
- x1 and x2 are scores on test x administered at times 1 and 2, respectively
- y1 and y2 are scores on test y administered at times 1 and 2, respectively
Given some arbitrary values of x1 and some correlations between:
- x1 and x2,
- x1 and y1,
I want to simulate data where (x2 - x1) and (y2 - y1) are correlated, with the correlation equal to some corr_deltas.
Could you please tell me if this is possible to do using your package and how to go about it?
Thanks!
— Maciej
Could you send your source data parameters like this? Where the x1:y_diff columns are the correlation matrix, and the means and SDs in the last two columns? Put NA for anything you don't know. I think the x_diff and y_diff correlations might be an emergent property of the x1, x2, y1, and y2 correlations, but need to check with some real data.
n var x1 x2 y1 y2 x_diff y_diff mean sd
1 1000 x1 1.00 0.60 0.40 0.20 -0.45 -0.22 0 1.00
2 1000 x2 0.60 1.00 0.20 0.40 0.45 0.22 0 1.00
3 1000 y1 0.40 0.20 1.00 0.60 -0.22 -0.45 0 1.00
4 1000 y2 0.20 0.40 0.60 1.00 0.22 0.45 0 1.00
5 1000 x_diff -0.45 0.45 -0.22 0.22 1.00 0.50 0 0.89
6 1000 y_diff -0.22 0.22 -0.45 0.45 0.50 1.00 0 0.89
For example, the following code always results in the exact same correlation between x_diff and y_diff for any given correlation matrix for x1:y2 (set empirical = TRUE to make sure each sample has the exact mean, sd, and r parameters). So there must be a systematic relationship between the parameters of x1:y2 and the parameters of the difference scores.
library(faux)
r <- c(.6, .4, .2,
.2, .4,
.6)
dat <- rnorm_multi(
n = 1000,
varnames = c("x1", "x2", "y1", "y2"),
r = r,
empirical = TRUE
)
dat$x_diff <- dat$x2 - dat$x1
dat$y_diff <- dat$y2 - dat$y1
check_sim_stats(dat)
Thank you for responding so quickly, Lisa!
I'll start by giving some additional information and answer your question next.
Context
The tests:
- Test x is our gold standard. Test y is a novel test, which hasn't been used yet, and we know nothing about it.
- Both tests are administered at times 1 and 2, six months apart, to the same sample of 40 participants.
Difference scores:
- I expect that test x will yield higher scores at time 2 because participants will undergo medical treatment. I don't know how large the improvement will be. Qualitatively, clinicians say that it will be "very large."
- If test y works correctly, it will also yield higher scores at time 2.
The study objective is to check whether improvements measured by test x correlate with improvements measured by test y.
The study hasn't started yet. I want to analyze how much power I will have given the sample size, different potential effect sizes, and different properties of test y, which we know nothing about. For this reason, I want to simulate the data with the difference scores correlated.
Could you send your source data parameters like this?
# Test x is our gold standard.
# Results of test x could have these parameters.
mu_x1 = 5
sd_x1 = 2.5
delta_x1_x2 = 3 # we expect to see a higher score at time 2
mu_x2 = mu_x1 + delta_x1_x2 # we expect to see a higher score at time 2
sd_x2 = sd_x1
corr_x1_x2 = 0.8 # test x is our gold standard so we expect high correlation between measurements at 1 and 2
# Test y is a novel test. We don't know anything about it.
# For simulation, let's assume that:
mu_y1 = mu_x1
sd_y1 = sd_x1 # = 2.5
corr_x1_y1 = 0.8
# I want to find y2 such that r(x2-x1,y2-y1) = corr_deltas
set.seed(321)
df <- rnorm_multi(
varnames = "x1", n = 100,vars = 1, mu = mu_x1, sd = sd_x1,) %>%
mutate(x2 = rnorm_pre(x1, mu = mu_x2, sd = sd_x2, r = corr_x1_x2), # x2 correlates with x1
y1 = rnorm_pre(x1, mu = mu_y1, sd = sd_x1, r = corr_x1_y1), # y1 correlates with x1
y2 = NA, # y2 correlates with y1, r(x2-x1,y2-y1) = corr_delta
x_diff = x2 - x1,
y_diff = y2 - y1)
check_sim_stats(df)
n var x1 x2 y1 x_diff y_diff mean sd
1 100 x1 1.00 0.75 0.81 -0.20 NA 5.02 2.39
2 100 x2 0.75 1.00 0.58 0.50 NA 7.77 2.70
3 100 y1 0.81 0.58 1.00 -0.19 NA 6.83 2.57
4 100 x_diff -0.20 0.50 -0.19 1.00 NA 2.75 1.81
5 100 y_diff NA NA NA NA 1 NA NA
write.csv(file = "source_data_parameters.csv", x = as.data.frame(check_sim_stats(df))) # file attached
I could provide additional parameters of y2 --- see below ---, but I am not sure if it would possible to find values of y2 such that r(x_diff, y_diff) = corr_deltas
, as you hinted at in your reply.
delta_y1_y2 = 3
mu_y2 = mu_y1 + delta_y1_y2
sd_y2 = sd_x1 # =2.5
corr_y1_y2 = 0.9
I was hoping I could do something like the following, but I couldn't get it to work.
corr_deltas = 0.9
rnorm_multi(
varnames = "x1", n = 100,vars = 1, mu = mu_x1, sd = sd_x1,) %>%
mutate(x2 = rnorm_pre(x1, mu = mu_x2, sd = sd_x2, r = corr_x1_x2), # x2 correlates with x1
y1 = rnorm_pre(x1, mu = mu_y1, sd = sd_x1, r = corr_x1_y1), # y1 correlates with x1
y2 = rnorm_pre(x = c(x2-x1,y2-y1), r = corr_deltas) # find y2 such that r(x2-x1,y2-y1) = corr_deltas
)
Thank you again for your help!
— Maciej
After some fiddling, I think I have done it. It does feel a bit like a hack, so I will need to think about it more. I might also need to add some more parameters.
I would appreciate your feedback, Lisa.
Thank you for your time,
— Maciej
# Test x is our gold standard.
# Results of test x could have these parameters.
mu_x1 = 5
sd_x1 = 2.5
delta_x1_x2 = 3 # we expect to see a higher score at time 2
mu_x2 = mu_x1 + delta_x1_x2 # we expect to see a higher score at time 2
sd_x2 = sd_x1
corr_x1_x2 = 0.8 # test x is our gold standard so we expect high correlation between measurements at 1 and 2
# Test y is a novel test. We don't know anything about it.
# For simulation, let's assume that:
mu_y1 = mu_x1
sd_y1 = sd_x1 + 1 # = 3.5
corr_x1_y1 = 0.8
corr_deltas = 0.9876
df <-
rnorm_multi(
varnames = c("x1", "y1"), n = 40, vars = 2, mu = c(mu_x1, mu_y1), sd = c(sd_x1,sd_y1), r = (corr_x1_y1), empirical = TRUE) %>%
mutate(
x2 = rnorm_pre(x1, mu = mu_x1 + 3, sd = sd_x2, r = corr_x1_x2, empirical = TRUE),
delta_x = x2 - x1,
delta_y_correlated_with_delta_x = rnorm_pre(delta_x,
r = corr_deltas, empirical = TRUE),
y2 = y1 + delta_y_correlated_with_delta_x,
delta_y = y2 - y1,
)
cor.test(df$delta_x, df$delta_y, method="person")
Pearson's product-moment correlation
data: df$delta_x and df$delta_y
t = 38.779, df = 38, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9765110 0.9934713
sample estimates:
cor
0.9876
check_sim_stats(df)
n var x1 y1 x2 delta_x delta_y_correlated_with_delta_x y2 delta_y mean sd
1 40 x1 1.00 0.80 0.80 -0.32 -0.34 0.73 -0.34 5 2.50
2 40 y1 0.80 1.00 0.66 -0.22 -0.26 0.96 -0.26 5 3.50
3 40 x2 0.80 0.66 1.00 0.32 0.28 0.77 0.28 8 2.50
4 40 delta_x -0.32 -0.22 0.32 1.00 0.99 0.06 0.99 3 1.58
5 40 delta_y_correlated_with_delta_x -0.34 -0.26 0.28 0.99 1.00 0.02 1.00 0 1.00
6 40 y2 0.73 0.96 0.77 0.06 0.02 1.00 0.02 5 3.38
7 40 delta_y -0.34 -0.26 0.28 0.99 1.00 0.02 1.00 0 1.00
Here's one attempt (your parameters are at the top). I set empirical = TRUE
to make it easier to check that you set the simulation parameters right (they should be exactly the same in check_sim_stats), but you can change it to FALSE (the default) to make those the population parameters for simulations.
The pattern is to simulate x1 and x2, calculate x_diff, simulate y1 in relation to x1, simulate y_diff in relation to x_diff, and calculate y2 from y1 + y_diff.
library(faux)
library(tidyverse)
# Test x is our gold standard.
# Results of test x could have these parameters.
mu_x1 = 5
sd_x1 = 2.5
delta_x1_x2 = 3 # we expect to see a higher score at time 2
mu_x2 = mu_x1 + delta_x1_x2 # we expect to see a higher score at time 2
sd_x2 = sd_x1
corr_x1_x2 = 0.8 # test x is our gold standard so we expect high correlation between measurements at 1 and 2
# Test y is a novel test. We don't know anything about it.
# For simulation, let's assume that:
mu_y1 = mu_x1
sd_y1 = sd_x1 # = 2.5
corr_x1_y1 = 0.8
# I want to find y2 such that r(x2-x1,y2-y1) = corr_deltas
# additional needed parameters
corr_deltas = .5
y_diff = 3
y_diff_sd = 1.5
# setting empirical = TRUE treats mu, sd, and r as sample, not population, parameters
empirical = TRUE
df <- rnorm_multi(
n = 100,
varnames = c("x1", "x2"),
mu = c(mu_x1, mu_x2),
sd = c(sd_x1, sd_x2),
r = corr_x1_x2,
empirical = empirical) %>%
mutate(
x_diff = x2 - x1,
y1 = rnorm_pre(x1, mu = mu_y1, sd = sd_y1,
r = corr_x1_y1, empirical = empirical),
y_diff = rnorm_pre(x_diff, mu = y_diff, sd = y_diff_sd,
r = corr_deltas, empirical = empirical),
y2 = y1 + y_diff
) %>%
select(x1, x2, y1, y2, x_diff, y_diff)
check_sim_stats(df)
Oops, I didn't see your reply when I posted my response. I think we solved it basically the same way.
Brilliant - thank you!
— M.