ExchangeLeastsq.jl is a module to perform sparse model selection in least squares regression without shrinkage.
From a Julia prompt, type
Pkg.clone("https://github.com/klkeys/ExchangeLeastsq.jl")
The workhorse of ExchangeLeastsq.jl is the function exlstsq
which requires only two arguments:
x
, the statistical design matrixy
, the response vector
The function exlstsq
returns a sparse matrix betas
of estimated models:
betas = exlstsq(x, y)
Optional arguments with defaults include:
v = ELSQVariables(x, y)
is a container object of temporary arrays forexlstsq
models
is the integer vector of model sizes to test. It defaults tocollect(1:p)
, wherep = min(20, size(x,2))
.window = maximum(models)
is a window size of active predictors thatexlstsq
uses when searching through active predictors. Generally a smaller value ofwindow
means thatexlstsq
sifts through fewer active models, thereby increasing speed and sacrificing accuracy.max_iter = 100
is the maximum number of iterations thatexlstsq
will take in any inner looptol = 1e-6
is the convergence tolerancequiet = true
controls output to the console. Settingquiet = false
causesexlstsq
to print all inner loop information.
ExchangeLeastsq.jl is best used to obtain the ideal model size to predict y
.
It furnishes a crossvalidation routine for this purpose.
ExchangeLeastsq.jl makes use of SharedArrays
to enable crossvalidation in a multicore shared memory environment.
Users can perform q-fold crossvalidation for a vector models
of model sizes by calling
cv_output = cv_exlstsq(x, y)
Important optional arguments include:
models = collect(1:min(20,p))
, withp = size(x,2)
, is theInt
vector of model sizes to test.q = 5
is the number of crossvalidation foldsfolds
controls the fold structure. The defaultRegressionTools.cv_get_folds(y, q)
distributes data toq
folds as evenly as possible.
Here cv_output
is an ELSQCrossvalidationResults
container object with the following fields:
mses
contains the vector of mean squared errorsk
is the best crossvalidated model sizeb
andbidx
contain the coefficients and indices, respectively, of the model sizek
.
ExchangeLeastsq.jl interfaces with the PLINK.jl package to enable GWAS analysis.
PLINK.jl furnishes both multicore and GPU interfaces for GWAS analysis.
The multicore environment makes heavy used of SharedArray
interfaces.
For genotype data housed in a PLINK.BEDFile
object x
and a SharedVector
y
, the function call to exlstsq
is unchanged:
output = exlstsq(x, y)
However, the call to cv_exlstsq
changes dramatically in order to accomodate SharedArray
computing. Most users should use the call
cv_output = cv_exlstsq("PATH_TO_BEDFILE.bed", "PATH_TO_COVARIATES.txt", "PATH_TO_Y.bin")
Note the file extensions! The first file path points to the BED file, the second points to the covariates stored in a delimited text file, and the last points to the response variable y
stored as a binary file.
PLINK.jl also ships with a GPU interface for GWAS analysis.
The GPU environment uses OpenCL.jl wrappers to port the computational bottleneck x' * y
to the GPU.
PLINK.jl automatically loads the OpenCL kernels into the variables PLINK.gpucode64
(for 64-bit kernels) and PLINK.gpucode32
(for 32-bit kernels).
Assuming that a suitable device is available, then the calls from ExchangeLeastsq.jl to use GPUs are
output = exlstsq(x, y, PLINK.gpucode64)
cv_output = cv_exlstsq("PATH_TO_BEDFILE.bed", "PATH_TO_COVARIATES.txt", "PATH_TO_RESPONSE.bin", PLINK.gpucode64)