CCGill / LRpadding

A simple Rcpp(Eigen) package to implement 'padded logistic regression'

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LRpadding

Travis build status

LRpadding is a very short term project to implement a simple padding functionality for logistic regression, allowing the concatenation of arbitrarily many zeros on both the predictors and the response. This is part of a much larger project, this code is entirely for testing and development.

Without an intercept, padding in this way would result in no effect (simply multiplying the likelihood by a constant), so we assume we include an intercept as the first column of the model matrix.

Installation

You can install the current version of LRpadding from GitHub with:

remotes::install_github('CCGill/LRpadding')

We make use of the excellent RcppNumerical package (http://cran.r-project.org/package=RcppNumerical), and a simple example version of the RcppNumerical function fastLR, which is built on the L-BFGS algorithm for unconstrained minimization problems based on the LBFGS++ library.

Example

This is an example demonstrating that a logistic_reg implementation on padded response/model agrees with the implementation of the padding. It is based on a unit test for this package. Note that the basic_logistic_reg function is currently identical to the example version of fastLR from the RcppNumerical package, returning only the coefficients, and logistic_reg is identical to the full fastLR function.

We also show that our result closely matches the corresponding glm call.

library(LRpadding)
## basic example code
set.seed(42)
    
    n = 20000
    p = 100L
    x = matrix(rnorm(n * p), n)
    x = cbind(rep(1,n),scale(x,scale = F,center = T)) ##attach the intercept
    beta = runif(p+1)
    xbeta = c(x %*% beta)
    probs = 1 / (1 + exp(-xbeta))
    y = rbinom(n, 1, probs)
    testpadding = 40000L
  
    padded_x <- rbind(x,t(matrix( rep.int(c(1,0), c(1,p)), p+1, testpadding)))
    padded_y <- c(y,rep(0,testpadding))
    
    res1 <- logistic_reg(padded_x, padded_y)
    res2 <- padded_logistic_reg(x,y,padding = testpadding)
    max(abs(res1$coefficients - res2$coefficients)) ## identical results
#> [1] 3.166356e-13

    res1$loglikelihood
#> [1] -13321.19
    res2$loglikelihood
#> [1] -13321.19
    
    res3 <- basic_logistic_reg(padded_x,padded_y)
    res4 <- basic_padded_logistic_reg(x,y,padding = testpadding)
    max(abs(res3 - res4)) ## identical results
#> [1] 2.273737e-12

    max(abs(res1$coefficients - res3))
#> [1] 0.0006703128
    max(abs(res2$coefficients - res4))
#> [1] 0.0006703128

A second example, for time.

    set.seed(42)
    system.time(res1 <- logistic_reg(padded_x, padded_y)$coefficients )
#>    user  system elapsed 
#>   0.137   0.001   0.138
 
    system.time(res2 <- padded_logistic_reg(x,y,padding = testpadding)$coefficients)
#>    user  system elapsed 
#>   0.042   0.001   0.042

    system.time(res3 <- glm.fit(padded_x,padded_y,family = binomial())$coefficients)
#>    user  system elapsed 
#>   6.170   0.307   6.550

    max(abs(res1 - res2)) ## identical results
#> [1] 3.166356e-13

    max(abs(res2 - res3))
#> [1] 6.953141e-05

About

A simple Rcpp(Eigen) package to implement 'padded logistic regression'


Languages

Language:R 51.2%Language:C++ 46.9%Language:Makefile 1.9%