KChen-lab / MEDALT

Inference of Minimal Event Distance Aneuploidy Lineage Tree based on single cell copy number profile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for 10X / high N

anderswe opened this issue · comments

Hi Fang and Qihan,

Very grateful for your work with MEDALT! Excited to try this out.

Do you have any recommendations for running it using 10X single cell data? i.e. datasets with high N and low read depth?

So far, I'm running out of memory (currently 180gb on our institution's cluster) with anything larger than 2k cells or so.

Thanks!
Anders

Hi, I come across same issue with 3TB of memory is not enough to run on > 7k cells. Any chance @anderswe find solution? Or @jinzhuangdou can comment on such memory issue?

Are you using the lastest version? We have some optimization on the memory issue.

Hi, @jinzhuangdou

I git clone the master branch few weeks ago. Would there be another version?

@jinzhuangdou Do you think if I share the input file (e.g., infercnv output) it might help with troubleshooting together?

Input file (17K cells; infercnvpy output):
https://drive.google.com/file/d/1Osksu94leVSzvXjlLl1btnfr74mYAhsK/view?usp=drive_link

To read:
Had to change line 24 on dataTransfer.R to data=read.csv(inputfile,sep="\t",row.names=1) to read without error

Received with thanking you. We are testing the performance on memory usage with your input data. Will let you know once we have some ideas. Thanks

Hi @jpark27 , could you test the new version that supports Python 3 to assess its memory usage? The main script is SC1_py_sctree.py

python3 SC1_py_sctree.py -P ./ -I ./example/scDNA.CNV.txt -D D -G hg19 -O ./example/outputDNA

Hi, @jinzhuangdou! Thank you so much for the suggestion and I tried following command with similar size of input file on lsf with 60 cores, 3TB memory. However, even after 24hrs, it stuck at step2/3 as follows. Do you think its normal to take such long time (c.f. example scRNA.CNV.txt took < 1min with same set up)? or something wrong with current input file structure...

python3 SC1_py_sctree.py -P ./ -I ~/BB18_indexed.txt -D R -G hg38 -O ~/outputRNA -W 200

image

Hi @jpark27 , thank you for the update. It may require a large amount of memory when processing over 10K cells, especially considering the iterative construction of the MST tree across all cells. Could you utilize hierarchical clustering to identify different branches and then employ MEDALT to build the local branch tree within each cluster? This strategy has the potential to significantly reduce memory usage while maintaining the integrity of the analysis.

Hi, @jinzhuangdou! It makes sense to have large memory (currently, I set maximum on our lsf so I will leave it and have a look for few days).

That's good idea, I will split dataset into small chunks (cluster by cluster) and run MEDALT. As I am not super bioinformatic-savvy, would you recommend any specific tool or python package to do such hierarchical clustering before running MEDALT?

Hi, @jinzhuangdou! Hope you been well.
I have been trying subsetting the input file* into hierarchical cluster as suggested and re-run the analysis but still seem stuck at same step even after few days with large memory. Any chance would there be systemic issue of MEDALT handling >16K genes?

#####################################################

now running SC2_RR_dataTransfer.R

#####################################################

16146/16452 genes matched in ref_seq.
saved file: 2_BB45_cluster5_bin_200.csv


Input file (17K cells; 16K genes infercnvpy output):
https://drive.google.com/file/d/1Osksu94leVSzvXjlLl1btnfr74mYAhsK/view?usp=drive_link

Input file2 (0.3K cells; 16K genes infercnvpy output):
https://drive.google.com/file/d/1arTjMpyZuj3s_NBnwglKv3lJj5xoaLsn/view?usp=drive_link