parklab / MuSiCal

A comprehensive toolkit for mutational signature analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

De novo signature discovery - very slow

maia-munteanu opened this issue · comments

Hi!

Thank you for developing this tool! I am currently using it for signature refitting and decomposition, but I'd like to also take advantage of the signature discovery module. However, I've been having some issues with runtime. When trying to obtain signatures from a matrix with ~6000 samples, a 24h job only gets to 1 signature extracted which is very slow compared to other tools and I'm wondering if I'm doing something wrong when assigning resources to the job.

These are the slurm options used (from the documentation I understand that to ncpu value should match the -n, not the -c)

#SBATCH -J musical
#SBATCH -n 20
#SBATCH -c 1
#SBATCH -t 24:00:00
#SBATCH --mem 20G

conda activate python37_musical
python3 Musical.py

And this is the MuSiCal script:

model = musical.DenovoSig(X, 
                          min_n_components=1, # Minimum number of signatures to test
                          max_n_components=10, # Maximum number of signatures to test
                          init='random', # Initialization method
                          method='mvnmf', # mvnmf or nmf
                          n_replicates=20, # Number of mvnmf/nmf replicates to run per n_components
                          ncpu=20, # Number of CPUs to use
                          max_iter=100000, # Maximum number of iterations for each mvnmf/nmf run
                          bootstrap=True, # Whether or not to bootstrap X for each run
                          tol=1e-8, # Tolerance for claiming convergence of mvnmf/nmf
                          verbose=1, # Verbosity of output
                          normalize_X=False # Whether or not to L1 normalize each sample in X before mvnmf/nmf
                         )
model.fit()

And the musical logs:

Extracting signatures for n_components = 1..................
Selected lambda_tilde = 2. This lambda_tilde will be used for all subsequent mvNMF runs.
/home/mmunteanu/.conda/envs/python37_musical/lib/python3.7/site-packages/musical/mvnmf.py:509: UserWarning: No p-value is smaller than or equal to 0.05. The largest lambda_tilde is selected. Enlarge the search grid of lambda_tilde.
  UserWarning)
n_components = 1, replicate 14 finished.
n_components = 1, replicate 10 finished.
n_components = 1, replicate 9 finished.
n_components = 1, replicate 18 finished.
n_components = 1, replicate 2 finished.
n_components = 1, replicate 1 finished.
n_components = 1, replicate 19 finished.
n_components = 1, replicate 7 finished.
n_components = 1, replicate 12 finished.
n_components = 1, replicate 15 finished.
n_components = 1, replicate 8 finished.
n_components = 1, replicate 6 finished.
n_components = 1, replicate 13 finished.
n_components = 1, replicate 11 finished.
n_components = 1, replicate 17 finished.
n_components = 1, replicate 3 finished.
n_components = 1, replicate 16 finished.
n_components = 1, replicate 4 finished.
n_components = 1, replicate 5 finished.
Time elapsed: 2.79e+04 seconds.
Extracting signatures for n_components = 2..................
Selected lambda_tilde = 0.1. This lambda_tilde will be used for all subsequent mvNMF runs.

Many thanks,
Maia

Hi Maia,

Thank you for raising the issue. There are several things to consider when running signature discovery on a large number of samples (~6000 in your case). But before I elaborate on those, it's good to make sure that parallel calculation is indeed working on your machine:

  • Make sure there are at least 20 CPUs on the compute node.
  • Compare the compute time for -n 20 -c 1, -n 1 -c 20, and -n 1 -c 1 (with ncpu=1) on a small test example. Make sure that -n 20 -c 1 is indeed the preferred setting on your HPC and that the expected speedup relative to the serial job is observed. This is worth checking because there might be system differences.

Now, things to consider when running signature discovery on a large number of samples.

First, try changing the default parameters to speed up the calculation.
The default parameters in DenovoSig are set quite conservatively to ensure convergence of the optimization algorithm. But we have observed empirically that even with more aggressive parameters, we could still achieve fairly good results. To scale up the calculation to a large number of samples (e.g., many thousands), it does make sense to use more aggressive parameters. I've just tested the following parameters on ~6300 TCGA WES samples and the discovered SBS signatures were of high quality to me. min_iter=1000, max_iter=10000, conv_test_freq=100, tol=1e-5, n_replicates=20, mvnmf_lambda_tilde_grid=np.array([1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]). These parameters 1) terminate the optimization earlier and 2) use a more coarse-grained grid for selecting the best regularization parameter. For my test with ~6300 samples, it took 2hr10min with 20 CPUs on our HPC, when min_n_components=1, max_n_components=10. Note that you can always modify these parameters more towards the default later to see if there is any major difference in the results.

Second, try using NMF results to inform mvNMF runs.
DenovoSig with NMF (method='nmf') is fast: for example, with default parameters, it only took 40min for the dataset above with ~6300 samples (again 20 CPUs). This run will provide the best n_components (i.e., number of signatures) estimate, or at least the range of that. Then, you can run DenovoSig again, this time with mvNMF (method='mvnmf'), while setting min_n_components and max_n_components to be around that estimate. This procedure saves computation time by narrowing the search space for n_components.

Third, think about whether running signature discovery on the entire dataset is the optimal approach.
I don't know the details of your ~6000 samples. But for tumors, that number usually means a cohort composed of many tumor types. If that's the case, it's worth considering whether splitting them up is a more reasonable approach. For example, you can run signature discoveries per tumor type. You can also try stratifying samples into distinct groups beforehand with musical.preprocessing.stratify_samples(), which can run directly on the input matrix.

Hope this helps!

Best,
Hu

Let me know if there are any further questions. Closing this issue for now.