CDPA: Common and Distinctive Pattern Analysis between High-dimensional Datasets

This python package implements the CDPA method proposed in the paper [1]. See example.py for details, with Python 3.6.3 (or above) and the lapjv package (pip install lapjv).

Let be two datasets measured on a common set of objects, where is the number of variables in the -th dataset. The CDPA method conducts the following decomposition:

where is obtained from the D-CCA method. Specifically,

: the signal matrix,

: the common-source matrix,

: the distinctive-source matrix,

: the noise matrix,

: the matrix with zero padding and permutation matrix if necessary,

: the common-pattern matrix rescaled with the magnitude of ,

: the distinctive-pattern matrix.

Use the CDPA function:

X_1_hat, X_2_hat, C_1_hat, C_2_hat, D_1_hat, D_2_hat, r_1_hat, r_2_hat, r_12_hat, \
ccor_hat, ctheta_hat, C_mat_hat, C_mat_neg_hat, pcor_hat, ptheta_hat, P_mat_hat \
= dcca.CDPA(Y_1, Y_2, r_1=None, r_2=None, r_12=None, method=None, P_set = 1, P_mat=None, assignment=1)

X_1_hat, X_2_hat, C_1_hat, C_2_hat, D_1_hat, D_2_hat, r_1_hat, r_2_hat, r_12_hat, \
ccor_hat, ctheta_hat, C_mat_hat, C_mat_neg_hat, pcor_hat, ptheta_hat \
= dcca.CDPA(Y_1, Y_2, r_1=None, r_2=None, r_12=None, method=None, P_set = 0)

with the function parameters:

r_1, r_2: the ranks of and .
method: If r_1 and r_2 are None, then for the selection of r_1 and r_2, method='ED' and method=None use the ED method [3], and method='GR' uses the GR method [4].
r_12: the rank of . If r_12=None, then r_12 is automatically selected by the MDL-IC method [5].
P_set, P_mat, assignment: If P_set=0, no row matching of and . If P_set=1, then the row matching is implemented by permuting to be : in particular, if P_mat=None, then assignment=1 and assignment=0 respectively use a greedy algorithm [6] (fast but less accurate) and the Jonker-Volgenant algorithm [7] for the DSPFP method [6] for the row matching, and otherwise P_mat can be assigned to be a given permutation matrix.

with output:

X_1_hat, X_2_hat, C_1_hat, C_2_hat, D_1_hat, D_2_hat: the D-CCA matrix estimates [2] for .
r_1_hat, r_2_hat, r_12_hat: estimates of r_1, r_2, r_12.
ccor_hat, ctheta_hat: the estimated canonical correlations and associated angles between the respective latent-factor spaces of and .
C_mat_hat: the estimate for the unscaled common-pattern matrix of and .
C_mat_neg_hat: the estimate for the unscaled common-pattern matrix of and .
ptheta_hat, pcor_hat: the estimated principal angles and their cosines between the respective column spaces of and that are coefficient matrices of and on the common latent factors of and .
P_mat_hat: the estimate fo the permutation matrix for .

[1] Shu, H., & Qu, Z. (2022). CDPA: Common and Distinctive Pattern Analysis between High-dimensional Datasets. Electronic Journal of Statistics, 16(1), 2475–2517.

[2] Shu, H., Wang, X., & Zhu, H. (2020) D-CCA: A Decomposition-based Canonical Correlation Analysis for High-dimensional Datasets. Journal of the American Statistical Association, 115(529), 292-306.

[3] Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics, 92(4), 1004-1016.

[4] Ahn, S. C., & Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors. Econometrica, 81(3), 1203-1227.

[5] Song, Y., Schreier, P. J., Ramírez, D., & Hasija, T. (2016). Canonical correlation analysis of high-dimensional data with very small sample support. Signal Processing, 128, 449-458.

[6] Lu, Y., Huang, K., & Liu, C. L. (2016). A fast projected fixed-point algorithm for large graph matching. Pattern Recognition, 60, 971-982.

[7] Jonker, R., & Volgenant, A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4), 325-340.

shu-hai / CDPA

CDPA: Common and Distinctive Pattern Analysis between High-dimensional Datasets

About

Languages