CSI indexes

Question

CSI indexes

MorexV3CAGE opened this issue 8 months ago · comments

Hi,
we wanted to analyze our ATAC-seq data from plants using csaw, but the problem is that the BAM indexes need to be in the CSI format due to the size of the chromosomes. Is there a way to use csaw with these indexes? Or do you maybe plan to make an option to use these indexes in the future?
If you have any suggestions maybe even for a different software that works well with these indexes, I would appreciate it as well.

Thank you!

Aaron Lun · Answer 1 · Fri Oct 20 2023 17:05:39 GMT+0800 (China Standard Time)

I think this is pretty reasonable but I'm quite busy. Can you make a MRE with some small mock data to help me out?

In particular, I would like to know how far we can go with the current code. You might be able toactually create a Rsamtools::BamFile() instance with index= set to a CSI file, pass that to csaw::windowCounts() or related functions, and that CSI file path should get passed through to the C++ code. Fingers crossed, the latest version of Rhtslib might be able to read the CSI file, in which case everything might just work as-is.

MorexV3CAGE · Answer 2 · Thu Oct 26 2023 13:38:20 GMT+0800 (China Standard Time)

Thank you for your response. I have tried it with the Rsamtools::BamFile() and CSI indices and it worked so far. But my R crashes when it comes to filterWindowsLocal. This is due to the size of the dataset, since when calculating the filterWindows the RAM usage reaches 100GB+ and even tho I have 256GB (using Jupyter notebook rstudio) it crashes. Tried it with smaller portion of the data and it worked, so I guess I will have to split up the process somehow. Or would you also have some recommendations regarding this function?
Otherwise, I think, if I don't come across any future problems, the Rsamtools::BamFile() is the best and easiest solution.

Aaron Lun · Answer 3 · Sat Nov 11 2023 16:42:44 GMT+0800 (China Standard Time)

Sorry for the late reply. I've never seen it use so much memory before. I guess if you have super-long chromosomes, it'll create a large matrix to accommodate all of the windows. How long are your chromosomes in total, how many samples do you have, and what window sizes/spacings are you using?

MorexV3CAGE · Answer 4 · Tue Nov 14 2023 19:50:53 GMT+0800 (China Standard Time)

Hi, yeah I guess it would be due to the size of the dataset. The chromosomes are this:
sequence length
Chr1A 601925861
Chr1B 720616616
Chr2A 802176689
Chr2B 824672899
Chr3A 758701763
Chr3B 866600556
Chr4A 772123497
Chr4B 703802483
Chr5A 723594571
Chr5B 742439866
Chr6A 627992934
Chr6B 739292552
Chr7A 753887139
Chr7B 749354956

And there are 7 samples with 3 replicas each.
The Max fragment size for readParam was 200, as well as the window width for windowCounts was 200.
This we tried to increase to windows of 2000 for the regionCounts and that's where it went into the high numbers of RAM usage.

But in the end, we didn't need to use this function, so our analysis is finished. Thanks for the CSI index loading idea.