Support for 10X / high N

Question

Support for 10X / high N

anderswe opened this issue a year ago · comments

Hi Fang and Qihan,

Very grateful for your work with MEDALT! Excited to try this out.

Do you have any recommendations for running it using 10X single cell data? i.e. datasets with high N and low read depth?

So far, I'm running out of memory (currently 180gb on our institution's cluster) with anything larger than 2k cells or so.

Thanks!
Anders

Jun Sung Park · Answer 1 · Thu May 09 2024 09:50:22 GMT+0800 (China Standard Time)

Hi, I come across same issue with 3TB of memory is not enough to run on > 7k cells. Any chance @anderswe find solution? Or @jinzhuangdou can comment on such memory issue?

JinzhuangDou · Answer 2 · Sat May 11 2024 04:38:30 GMT+0800 (China Standard Time)

Are you using the lastest version? We have some optimization on the memory issue.

Jun Sung Park · Answer 3 · Sat May 11 2024 04:42:12 GMT+0800 (China Standard Time)

Hi, @jinzhuangdou

I git clone the master branch few weeks ago. Would there be another version?

Jun Sung Park · Answer 4 · Sat May 11 2024 04:51:14 GMT+0800 (China Standard Time)

@jinzhuangdou Do you think if I share the input file (e.g., infercnv output) it might help with troubleshooting together?

Input file (17K cells; infercnvpy output):
https://drive.google.com/file/d/1Osksu94leVSzvXjlLl1btnfr74mYAhsK/view?usp=drive_link

To read:
Had to change line 24 on dataTransfer.R to data=read.csv(inputfile,sep="\t",row.names=1) to read without error

JinzhuangDou · Answer 5 · Tue May 14 2024 12:22:16 GMT+0800 (China Standard Time)

Received with thanking you. We are testing the performance on memory usage with your input data. Will let you know once we have some ideas. Thanks

JinzhuangDou · Answer 6 · Thu May 16 2024 03:06:59 GMT+0800 (China Standard Time)

Hi @jpark27 , could you test the new version that supports Python 3 to assess its memory usage? The main script is SC1_py_sctree.py

python3 SC1_py_sctree.py -P ./ -I ./example/scDNA.CNV.txt -D D -G hg19 -O ./example/outputDNA

Jun Sung Park · Answer 7 · Fri May 17 2024 01:37:50 GMT+0800 (China Standard Time)

Hi, @jinzhuangdou! Thank you so much for the suggestion and I tried following command with similar size of input file on lsf with 60 cores, 3TB memory. However, even after 24hrs, it stuck at step2/3 as follows. Do you think its normal to take such long time (c.f. example scRNA.CNV.txt took < 1min with same set up)? or something wrong with current input file structure...

python3 SC1_py_sctree.py -P ./ -I ~/BB18_indexed.txt -D R -G hg38 -O ~/outputRNA -W 200

JinzhuangDou · Answer 8 · Fri May 17 2024 02:58:49 GMT+0800 (China Standard Time)

Hi @jpark27 , thank you for the update. It may require a large amount of memory when processing over 10K cells, especially considering the iterative construction of the MST tree across all cells. Could you utilize hierarchical clustering to identify different branches and then employ MEDALT to build the local branch tree within each cluster? This strategy has the potential to significantly reduce memory usage while maintaining the integrity of the analysis.

Jun Sung Park · Answer 9 · Fri May 17 2024 03:04:05 GMT+0800 (China Standard Time)

Hi, @jinzhuangdou! It makes sense to have large memory (currently, I set maximum on our lsf so I will leave it and have a look for few days).

That's good idea, I will split dataset into small chunks (cluster by cluster) and run MEDALT. As I am not super bioinformatic-savvy, would you recommend any specific tool or python package to do such hierarchical clustering before running MEDALT?

Jun Sung Park · Answer 10 · Tue May 21 2024 19:49:13 GMT+0800 (China Standard Time)

Hi, @jinzhuangdou! Hope you been well.
I have been trying subsetting the input file* into hierarchical cluster as suggested and re-run the analysis but still seem stuck at same step even after few days with large memory. Any chance would there be systemic issue of MEDALT handling >16K genes?

#####################################################

now running SC2_RR_dataTransfer.R

#####################################################

16146/16452 genes matched in ref_seq.
saved file: 2_BB45_cluster5_bin_200.csv

Input file (17K cells; 16K genes infercnvpy output):
https://drive.google.com/file/d/1Osksu94leVSzvXjlLl1btnfr74mYAhsK/view?usp=drive_link

Input file2 (0.3K cells; 16K genes infercnvpy output):
https://drive.google.com/file/d/1arTjMpyZuj3s_NBnwglKv3lJj5xoaLsn/view?usp=drive_link