cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium

Home Page:https://cumc.github.io/xqtl-protocol/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

non-conformable arrays error from susie_rss via rpy2 but not via native r.

hsun3163 opened this issue · comments

When running rpy2-susie_rss, following error occurs

>>> susie_result = susie.susie_rss(ss_qc.z.to_numpy(),LD,max_iter = 4)
R[write to console]: WARNING: Providing the sample size (n), or even a rough estimate of n, is highly recommended. Without n, the implicit assumption is n is large (Inf) and the effect sizes are small (close to zero).

R[write to console]: HINT: For large R or large XtX, consider installing the Rfast package for better performance.

R[write to console]: Error in Xty - s$XtXr : non-conformable arrays

R[write to console]: In addition:
R[write to console]: Warning message:

R[write to console]: In (function (package, help, pos = 2, lib.loc = NULL, character.only = FALSE,  :
R[write to console]:

R[write to console]:  library '/usr/lib/R/site-library' contains no packages

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/rpy2/robjects/functions.py", line 204, in __call__
    return (super(SignatureTranslatedFunction, self)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/robjects/functions.py", line 127, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
    cdata = function(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/rinterface.py", line 817, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in Xty - s$XtXr : non-conformable arrays

However, while running susie_rss using the same data in R. Things runs without errors.

WARNING: Providing the sample size (n), or even a rough estimate of n, is highly recommended. Without n, the implicit assumption is n is large (Inf) and the effect sizes are small (close to zero).

Warning message in susie_suff_stat(XtX = R, Xty = z, n = 2, yty = 1, scaled_prior_variance = prior_variance, :
“IBSS algorithm did not converge in 100 iterations!
                  Please check consistency between summary statistics and LD matrix.
                  See [https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html”](https://stephenslab.github.io/susieR/articles/susierss_diagnostic.html%E2%80%9D)

One hints for the error is that

library '/usr/lib/R/site-library' contains no packages

which may indicate lacking of dependencies within susie rpy2.

commented

could u make sure this works with R directly, when you call susieR::susie_rss from R ..

could u make sure this works with R directly, when you call susieR::susie_rss from R ..

Yes, the second message is the direct output of susie_rss.

The error actually happened in the s = update_each_effect_ss(XtX, Xty, s, estimate_prior_variance, estimate_prior_method, check_null_threshold) function of susie_suff_stat

I believed the non-conformability is caused by

      # Remove lth effect from fitted values.
      s$XtXr = s$XtXr - XtX %*% (s$alpha[l,] * s$mu[l,])

Since the issue occurs even if max_iter = 1, the problem should be on the input s of the update_each_effect_ss

Mimicing the s$XtXr = s$XtXr - XtX %*% (s$alpha[l,] * s$mu[l,]) in python produce a matrix that can successfully do Xty - s$XtXr

>>> ss_qc.z - (XtXr - LD @ (s_final[0][0] * s_final[1][0]))
0       1.520408
1       0.718750
2       0.268789
3       1.578947
4      -0.094105
          ...
9336    0.323810
9337   -1.522989
9338    0.342857
9339   -2.011353
9340    0.626412
Length: 9341, dtype: float64
>>>

When inputting the manually generated s and LD and Z to the problematic update_each_effect_ss function, a different error occurs. This error happend after the original error, which means the original error is passed.

>>> s_update = susie.update_each_effect_ss(XtX = LD, Xty = ss_qc.z.to_numpy(),s_init =  s_final)
R[write to console]: Error in s$mu[l, ] <- res$mu : replacement has length zero

R[write to console]: In addition:
R[write to console]: Warning message:

R[write to console]: In w * prior_weights :
R[write to console]:

R[write to console]:  longer object length is not a multiple of shorter object length

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/rpy2/robjects/functions.py", line 204, in __call__
    return (super(SignatureTranslatedFunction, self)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/robjects/functions.py", line 127, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
    cdata = function(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/rinterface.py", line 817, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in s$mu[l, ] <- res$mu : replacement has length zero

>>>
>>> s_update = susie.update_each_effect_ss(XtX = LD, Xty = ss_qc.z.to_numpy(),s_init =  s_final,estimate_prior_variance = True)
R[write to console]: Error in if (neg.loglik.logscale(lV, betahat = betahat, shat2 = shat2,  :
  missing value where TRUE/FALSE needed

R[write to console]: In addition:
R[write to console]: There were 50 or more warnings (use warnings() to see the first 50)
R[write to console]:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/rpy2/robjects/functions.py", line 204, in __call__
    return (super(SignatureTranslatedFunction, self)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/robjects/functions.py", line 127, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
    cdata = function(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/rpy2/rinterface.py", line 817, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in if (neg.loglik.logscale(lV, betahat = betahat, shat2 = shat2,  :
  missing value where TRUE/FALSE needed

>>>

Therefore, the error happened within this code block:

function (XtX, Xty, yty, n, X_colmeans = NA, y_mean = NA, maf = NULL, 
    maf_thresh = 0, L = 10, scaled_prior_variance = 0.2, residual_variance = NULL, 
    estimate_residual_variance = TRUE, estimate_prior_variance = TRUE, 
    estimate_prior_method = c("optim", "EM", "simple"), check_null_threshold = 0, 
    prior_tol = 1e-09, r_tol = 1e-08, prior_weights = NULL, null_weight = 0, 
    standardize = TRUE, max_iter = 100, s_init = NULL, coverage = 0.95, 
    min_abs_corr = 0.5, tol = 0.001, verbose = FALSE, track_fit = FALSE, 
    check_input = FALSE, refine = FALSE, check_prior = FALSE, 
    n_purity = 100, ...) 
{
    args <- list(...)
    if (any(is.element(names(args), c("bhat", "shat", "R", "var_y")))) 
        stop("susie_suff_stat no longer accepts inputs bhat, shat, R or var_y; ", 
            "these inputs are now accepted by susie_rss instead")
    estimate_prior_method = match.arg(estimate_prior_method)
    if (missing(n)) 
        stop("n must be provided")
    if (n <= 1) 
        stop("n must be greater than 1")
    missing_XtX = c(missing(XtX), missing(Xty), missing(yty))
    if (all(missing_XtX)) 
        stop("Please provide all of XtX, Xty, yty, n")
    if (ncol(XtX) > 1000 & !requireNamespace("Rfast", quietly = TRUE)) 
        warning_message("For large R or large XtX, consider installing the ", 
            "Rfast package for better performance.", style = "hint")
    if (ncol(XtX) != length(Xty)) 
        stop(paste0("The dimension of XtX (", nrow(XtX), " by ", 
            ncol(XtX), ") does not agree with expected (", length(Xty), 
            " by ", length(Xty), ")"))
    if (!is_symmetric_matrix(XtX)) {
        warning_message("XtX is not symmetric; forcing XtX to be symmetric by ", 
            "replacing XtX with (XtX + t(XtX))/2")
        XtX = XtX + t(XtX)
        XtX = XtX/2
    }
    if (!is.null(maf)) {
        if (length(maf) != length(Xty)) 
            stop(paste("The length of maf does not agree with expected", 
                length(Xty)))
        id = which(maf > maf_thresh)
        XtX = XtX[id, id]
        Xty = Xty[id]
    }
    if (any(is.infinite(Xty))) 
        stop("Input Xty contains infinite values")
    if (!(is.double(XtX) & is.matrix(XtX)) & !inherits(XtX, "CsparseMatrix")) 
        stop("Input XtX must be a double-precision matrix, or a sparse matrix")
    if (anyNA(XtX)) 
        stop("Input XtX matrix contains NAs")
    if (anyNA(Xty)) {
        warning_message("NA values in Xty are replaced with 0")
        Xty[is.na(Xty)] = 0
    }
    if (check_input) {
        semi_pd = check_semi_pd(XtX, r_tol)
        if (!semi_pd$status) 
            stop("XtX is not a positive semidefinite matrix")
        proj = check_projection(semi_pd$matrix, Xty)
        if (!proj$status) 
            warning_message("Xty does not lie in the space of the non-zero eigenvectors ", 
                "of XtX")
    }
    if (is.numeric(null_weight) && null_weight == 0) 
        null_weight = NULL
    if (!is.null(null_weight)) {
        if (!is.numeric(null_weight)) 
            stop("Null weight must be numeric")
        if (null_weight < 0 || null_weight >= 1) 
            stop("Null weight must be between 0 and 1")
        if (is.null(prior_weights)) 
            prior_weights = c(rep(1/ncol(XtX) * (1 - null_weight), 
                ncol(XtX)), null_weight)
        else prior_weights = c(prior_weights * (1 - null_weight), 
            null_weight)
        XtX = cbind(rbind(XtX, 0), 0)
        Xty = c(Xty, 0)
    }
    p = ncol(XtX)
    if (standardize) {
        dXtX = diag(XtX)
        csd = sqrt(dXtX/(n - 1))
        csd[csd == 0] = 1
        XtX = t((1/csd) * XtX)/csd
        Xty = Xty/csd
    }
    else csd = rep(1, length = p)
    attr(XtX, "d") = diag(XtX)
    attr(XtX, "scaled:scale") = csd
    if (length(X_colmeans) == 1) 
        X_colmeans = rep(X_colmeans, p)
    if (length(X_colmeans) != p) 
        stop("The length of X_colmeans does not agree with number of variables")
    s = init_setup(0, p, L, scaled_prior_variance, residual_variance, 
        prior_weights, null_weight, yty/(n - 1), standardize)
    s$Xr = NULL
    s$XtXr = rep(0, p)
    if (!missing(s_init) && !is.null(s_init)) {
        if (!inherits(s_init, "susie")) 
            stop("s_init should be a susie object")
        if (max(s_init$alpha) > 1 || min(s_init$alpha) < 0) 
            stop("s_init$alpha has invalid values outside range [0,1]; please ", 
                "check your input")
        s_init = susie_prune_single_effects(s_init)
        num_effects = nrow(s_init$alpha)
        if (missing(L)) {
            L = num_effects
        }
        else if (min(p, L) < num_effects) {
            warning_message(paste("Specified number of effects L =", 
                min(p, L), "is smaller than the number of effects", 
                num_effects, "in input SuSiE model. The initialized SuSiE model will have", 
                num_effects, "effects."))
            L = num_effects
        }
        s_init = susie_prune_single_effects(s_init, min(p, L), 
            s$V)
        s = modifyList(s, s_init)
        s = init_finalize(s, X = XtX)
        s$XtXr = s$Xr
        s$Xr = NULL
    }
    else s = init_finalize(s)
    elbo = rep(as.numeric(NA), max_iter + 1)
    elbo[1] = -Inf
    tracking = list()
    bhat = (1/attr(XtX, "d")) * Xty
    shat = sqrt(s$sigma2/attr(XtX, "d"))
    z = bhat/shat
    zm = max(abs(z[!is.nan(z)]))

To further narrow down the problem, following custom function are created:

susie_test = function (XtX, Xty, yty, n, X_colmeans = NA, y_mean = NA, maf = NULL, 
    maf_thresh = 0, L = 10, scaled_prior_variance = 0.2, residual_variance = NULL, 
    estimate_residual_variance = TRUE, estimate_prior_variance = TRUE, 
    estimate_prior_method = "EM", check_null_threshold = 0, 
    prior_tol = 1e-09, r_tol = 1e-08, prior_weights = NULL, null_weight = 0, 
    standardize = TRUE, max_iter = 100, s_init = NULL, coverage = 0.95, 
    min_abs_corr = 0.5, tol = 0.001, verbose = FALSE, track_fit = FALSE, 
    check_input = FALSE, refine = FALSE, check_prior = FALSE, 
    n_purity = 100, ...) 
{
    null_weight = NULL
    p = ncol(XtX)
    csd = rep(1, length = p)
    attr(XtX, "d") = diag(XtX)
    attr(XtX, "scaled:scale") = csd
    if (length(X_colmeans) == 1) 
        X_colmeans = rep(X_colmeans, p)
    if (length(X_colmeans) != p) 
        stop("The length of X_colmeans does not agree with number of variables")
    s = susieR:::init_setup(0, p, L, scaled_prior_variance, residual_variance, 
        prior_weights, null_weight, yty/(n - 1), standardize)
    s$Xr = NULL
    s$XtXr = rep(0, p)
    s = susieR:::init_finalize(s)
    return(list(s,XtX,Xty,estimate_prior_variance, 
            estimate_prior_method, check_null_threshold))}

The output of this function successfully reproduce the error we encountered in rpy2, while can run successfully in R.

Running the rpy2 output manually also ran successfully. Indicating that the problem lied within the rpy2's update_estimate_ss function?

>>> susie_test_result[2] - (susie_test_result[0][10] - susie_test_result[1] @ (susie_test_result[0][0][0] * susie_test_result[0][1][0]))
array([ 1.52040816,  0.71875   ,  0.26878868,  1.57894737, -0.09410548,
       -0.43453237,  0.18144611,  0.34202899,  0.22622108,  0.05868545,
       -0.6559792 , -0.64009662, -1.69830508, -0.82915718, -0.85172414,
       -0.01032258, -0.56972112, -1.359375  , -1.52794118, -0.51415094,
       -0.59751037, -0.6875    , -1.67244367, -1.32340161,  1.23364486,
       -0.08539326,  1.88541667, -0.1981982 , -0.15646941,  0.12709966,
       -0.04102564,  0.72669221, -0.97077409, -1.08931186, -1.33023256,
       -0.92170022, -0.23389495, -1.18941504,  1.87763713, -0.7159383 ,
       -1.09298999,  0.23650794,  1.12117347, -0.51175407, -0.60337553,
       -0.95657277,  0.70100503,  1.54150198,  0.25      ,  1.44086022,
       -1.10892388, -0.73384615,  1.5375817 , -2.1669506 ,  1.20645161,
       -0.70711297, -1.00528053, -0.62869198, -0.51045179, -2.79203899,
        0.28806584, -2.0787472 , -0.281493  , -0.50179856, -0.9838565 ,
        1.65384615, -0.57119342, -1.09428951, -0.22052067, -0.99426934,
        0.09616775,  0.06650831,  0.90880503, -1.74428274, -1.00176367,
       -0.30382294, -0.1563981 ,  1.73062731,  0.41119221, -0.28993435,
       -1.01976285, -0.50399202,  0.15348837,  1.48341232,  1.01708279,
       -1.80382775, -0.02060606, -0.45958796,  1.82857143, -0.37007874,
        1.48221344,  0.96875   , -0.96486486,  1.58695652,  0.22026432,
       -1.88942892, -0.78695652,  0.14841849,  1.14718019,  1.45810398])
commented

@hsun3163 the function call is just susieR::susie_rss, right? It is my understanding that the input to the rpy2 version of this function would return error but not the native R version. Then the only source of error is the input data type (not even data format ) ... I wonder if you add something like as.matrix to the relevant lines in the susieR package, to force the data types to unify, would there still be an issue? I am thinking the fix might come from susieR itself. It is helpful to identify where the issue comes from using the rpy2 interface, so we can fix the R codes. =

@hsun3163 the function call is just susieR::susie_rss, right? It is my understanding that the input to the rpy2 version of this function would return error but not the native R version. Then the only source of error is the input data type (not even data format ) ... I wonder if you add something like as.matrix to the relevant lines in the susieR package, to force the data types to unify, would there still be an issue? I am thinking the fix might come from susieR itself. It is helpful to identify where the issue comes from using the rpy2 interface, so we can fix the R codes. =

Yes, the cause of this issue is that, for some reason the z vector are considered array instead of matrix array in rpy2. after as.matrix(z), this error no longer exist. I think this unificiation should be done as upstream as possible. But let me first verify if array can't minus a matrix array in R.

As a side product of this investigation, now I figure out how to use rpy2 as a wrapper of arbitrary R code, so I should be able to fix this at the input level and also add other modification of result within R context. But let me verify that as well.

  1. It is verified that array and matrix are not conformable, TIL.
array(ss_qc$z)-as.matrix(ss_qc$z)
Error in array(ss_qc$z) - as.matrix(ss_qc$z): non-conformable arrays
Traceback:
  1. Using following wrapper can fix the issue. I believed it is best to solve it outside of the susieR package, for the point that need unification may be many and this issue seems only to be caused by rpy2. TBH I have not seen any array data type in R usage.

This wrapper can successfully produce what we need

susie_rss_analysis=function(z, R, n, bhat, shat, var_y, z_ld_weight = 0, estimate_residual_variance = FALSE, 
    prior_variance = 50, check_prior = TRUE, output_path ){
    s_result = susieR:::susie_rss(as.matrix(z),as.matrix(R),n,var_y, z_ld_weight = 0, estimate_residual_variance = FALSE, 
    prior_variance = 50, check_prior = TRUE)
    saveRDS(s_result,output_path)
    return(s_result)
}
commented

i see. Okay let's do what you propose!