yizenglistat / rvcm4gt

Regularized Bayesian varying coefficient regression for group testing data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RVCM4GT

build release downloads

This repository contains R codes (for reproducibility) along with simulation results for "Regularized Bayesian varying coefficient regression models for group testing data". Our model is try to estimate an individual-level regression model based on group testing data that can capture the age-varying impact on the Chlamydia risk with selection. To relate available information, we consider

$$ \text{logit}(\text{Pr}(\widetilde Y_i=1\mid \boldsymbol x_i, u_i))=\underbrace{\psi_0(u_i)+\sum_{d=1}^p x_{id}\psi_d(u_i)}_{\text{Age-varying Effects}} + \underbrace{\sum_{\ell=1}^L r_\ell(i)\gamma_\ell}_{\text{Random Effect}} \quad\text{for }i=1,\ldots,N, $$

where $\widetilde Y_i$ is the hidden chlamydia status for $i^{th}$ patient, age $u_i$, $\boldsymbol x_i=(x_{i1},\ldots,x_{ip})^\top$ are covariates, and $\psi_d(u_i)=\delta_{1d}(\alpha_d+\delta_{2d}\beta_d(u_i))$ for binary indicators $\delta_{1d},\delta_{2d}$; see more details in the paper. In short, our stochastic search variable selection categorize each of covariates into one of three groups:

  • $\delta_{1d}=0\longrightarrow$ insignificant effects.
  • $\delta_{1d}=1$
    • $\delta_{2d}=0\longrightarrow$ age-independent effects.
    • $\delta_{2d}=1\longrightarrow$ age-varying effects.

To reproduce the results in the paper, we provide implementation details as follows.

username@login001 ~$ git clone git@github.com:yizenglistat/rvcm4gt.git
username@login001 ~$ cd rvcm4gt

In addition, for the privacy of the Iowa SHL group testing data, we create a simulated fake Iowa group testing data (under /data/simulated_fake_data.csv) for illustration. As we will see the code running on the fake data set successfully below

image

Arguments

# A demo example to run 500 repetitions in one machine.
task_id <- 1	
nreps <- 500
Ns <- c(3000, 5000)
pool_sizes <- c(5, 10)
model_names <- c("m1", "m2")
testings <- c("AT", "DT", "IT")
N_test <- 600
sigma <- 0.5
  • task_id

The machine id. For example, 1,...,100 if running on the cluster. In this way, we will run 5 simulations independently on 100 nodes to have a total of 500 repetitions.

  • nreps

The repetitions.

  • Ns

A vector of sample sizes.

  • pool_sizes

A vector of pool sizes.

  • model_names

A vector of model names. Different model names corresponds to different varying function sets.

  • testings

A vector of testing protocols such as AT (array testing), DT (Dorfman Testing) or IT (Individual Testing).

  • N_test

Number of knots values in inference for estimated varying functions.

  • sigma

True random effect standard deviation

Reproduce

After setting up the environment (requirement.txt) and arguments, one should be able to run the following code in R to reproduce simulation results in the paper.

# R version 3.6.0+

> source('main.r')

After collecting .RData files under output/, one should be able to reproduce the results subsequently. The following demo figure and demo table show that $\textcolor{red}{\textbf{red}}$ means the $\textcolor{red}{\textbf{age-varying effects}}$, $\textcolor{blue}{\textbf{blue}}$ means the significant but $\textcolor{blue}{\textbf{age-independent effects}}$ and $\textbf{black}$ means the $\textbf{insignificant effects}$ can be both correctly identified and estimated; see details in the paper.

figure

Parameter Summary IT c=5 c=10
DT AT DT AT
\psi_5(u)=0\color{white}{-.1} IP 0.007 0.007 0.007 0.007 0.007
\psi_6(u)=0\color{white}{-.1} IP 0.007 0.007 0.008 0.007 0.007
\textcolor{blue}{\psi_1(u)=-1.0}
\textcolor{blue}{\psi_3(u)=-0.5}
Bias(CP95) -0.008(0.962) -0.015(0.940) -0.004(0.942) -0.034(0.912) -0.013(0.932)
SSD(ESE) 0.066(0.069) 0.072(0.072) 0.068(0.067) 0.095(0.083) 0.077(0.074)
Bias(CP95) -0.007(0.942) -0.008(0.936) -0.002(0.938) -0.012(0.938) -0.001(0.938)
SSD(ESE) 0.064(0.063) 0.067(0.063) 0.067(0.062) 0.072(0.068) 0.066(0.065)
Bias(CP95) 0.047(0.950) 0.044(0.952) 0.039(0.964) 0.059(0.924) 0.048(0.960)
SSD(ESE) 0.062(0.079) 0.062(0.080) 0.058(0.078) 0.067(0.084) 0.062(0.081)
Bias(CP95) -0.032(0.902) 0.000(0.954) -0.038(0.914) 0.002(0.964)
SSD(ESE) 0.081(0.053) 0.024(0.021) 0.084(0.062) 0.041(0.034)
Bias(CP95) -0.008(0.974) -0.001(0.988) -0.019(0.972) -0.006(0.996)
SSD(ESE) 0.021(0.024) 0.016(0.017) 0.050(0.046) 0.021(0.033)
Bias(CP95) 0.002(0.990) 0.000(0.920) -0.014(0.992) -0.010(0.990)
SSD(ESE) 0.011(0.014) 0.012(0.011) 0.032(0.052) 0.027(0.033)
Bias(CP95) 0.000(0.966) -0.003(0.974) -0.003(0.922) -0.003(0.966)
SSD(ESE) 0.008(0.007) 0.012(0.013) 0.011(0.009) 0.012(0.011)
Cost AVGtest 5000 2943.15 2971.33 3567.84 2943.73
Savings Percent 00.00% 41.14% 40.57% 28.64% 41.12%

IP: the inclusion probability of the any significant effect, i.e., $\alpha_d$ or $\beta_d(u)$.

IPF: the inclusion probability of the age-independent effect, i.e., $\alpha_d$ only.

IPV: the inclusion probability of the age-varying effect, i.e., $\beta_d(u)$ only.

Authors

yizenglistat Harrindy Joshua M. Tebbs

License

This project is licensed under the MIT License - see the License file for details.

About

Regularized Bayesian varying coefficient regression for group testing data

License:MIT License


Languages

Language:R 93.0%Language:C++ 7.0%