Read the shifted bam and continue the QC analysis

Question

Read the shifted bam and continue the QC analysis

XiaoyuZhan520 opened this issue 2 years ago · comments

Hello Jianhong,

Thanks for your excellent work!

My bam file is so big that it costs a lot of resources during the running.
In order to reduce the time consumption, I tried to perform the analysis step by step. It worked well for bamQC and shiftGAlignmentsList step. But it broke when I used the shifted bam as input and then performed PTscore analysis.

Could you please help with the problem? Many thanks in advance.

bamfile <- 'shifted.bam'
gal1 <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE, bigFile=FALSE)
pt <- PTscore(gal1, txs)

Error in PTscore(gal1, txs) : is(obj, "GAlignments") is not TRUE

ZHAN Xiaoyu · Answer 1 · Wed Aug 03 2022 16:43:55 GMT+0800 (China Standard Time)

Besides, I have some samples never finish running the bamQC step, which are ~30G-40G bams and cost 1,000G to run the bamQC step. Could you provide me with some suggestions about these big files? Thanks!

JIANHONG OU · Answer 2 · Wed Aug 10 2022 20:17:17 GMT+0800 (China Standard Time)

Hi Xiaoyu,

Thank you for trying ATACseqQC to analyze your data. I am sorry for the late reply. Yes this is known issue. I will put it on my TODO list to rewrite the bamQC step. In current steps, you can use samtools stats to get some information about it. And use samtools to filter the chrM from your bam file.

Jianhong.

RXCoux · Answer 3 · Thu Nov 03 2022 22:54:24 GMT+0800 (China Standard Time)

Hello, thanks for making such a complete and useful package - sorry to bother you, I've also been running into this kind of issue: I shifted bam files (quite big, 10-20Go). I would like to calculate their TSS scores and am doing the following:

bamfile<-'shifted.bam'
gal1 <- readBamFile(bamfile, tag=character(0), asMates=TRUE)

tsse <- TSSEscore(gal1, txs)
tsse$TSSEscore

I keep running into the following error: "Error in TSSEscore(gal1, txs) : is(obj, "GAlignments") is not TRUE"

I checked my shifted bams and the reads are shifted (at least the first and last 100), could you please let me know how is a bamfile consired an GAlignments and if we could force it to be ? Thanks in advance
Rémi

SergioRodLla · Answer 4 · Thu Dec 22 2022 17:28:23 GMT+0800 (China Standard Time)

Hello, RXCoux. I have the same problem and from what I understand the problem comes from the argument asMates=TRUE in the readBamFile() function. If it is set to FALSE , the function will store the output as a GAlignments object instead of a GAlignmentsList. which is your case. I have loaded it setting the argument to FALSE but then function documentation says that it is interpreting the BAM file as single-end reads instead of paired-end. I do not know if this is the right way to go, in any case TSSEscore() works this way.

Plase tell me if this worked for you and your thoughts about this.

Thank you,
Sergio

RXCoux · Answer 5 · Fri Dec 23 2022 07:18:33 GMT+0800 (China Standard Time)

Hi Sergio, thanks for the tip, it indeed worked and I was able to calculate and plot TSS scores.

I am however getting low scores (circa 6-7, I only tested on a subset of my bams and have to do it more extensively).

I am also using TxDb.Mmusculus.UCSC.mm10.knownGene as you mentioned in issue #54 but I was wondering whether processing the bams as SE could cause this, maybe @jianhong could enlighten us?

Thanks again and happy holidays !

SergioRodLla · Answer 6 · Fri Dec 23 2022 17:05:22 GMT+0800 (China Standard Time)

Hi @RXCoux , did you use TSSEscore() with the default arguments? If so then the scores you get may be smaller than expected if you are comparing to the ATAC-seq ENCODE standards as I mentioned in issue #54 .

I am also wondering if the datatype in which the BAM file is read affects the downstream analyses. In my case I tested TSSEscore() for only one sample doing the subsampling of chromosomes as follows:

seqlev <- c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10",
              "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19",
              "chrX", "chrY" ) ## subsample data
  seqinformation <- seqinfo(TxDb.Mmusculus.UCSC.mm10.knownGene)
  which <- as(seqinformation[seqlev], "GRanges")
  gal <- readBamFile(bam, tag=tags, which=which, asMates=TRUE, bigFile=TRUE)
  shiftedBamfile <- file.path(outPath, ".shifted.bam")
  gal1 <- shiftGAlignmentsList(gal, outbam=shiftedBamfile)

This is because there are some chromosomes in TxDb.Mmusculus.UCSC.mm10.knownGene listed as "_random" or something similar, that we not present in my reads so I filtered them.

Also, I loaded the BAM file previously stored by
gal1 <- shiftGAlignmentsList(gal, outbam=shiftedBamfile)
using
gal2 <- readBamFile(bam, tag=tags, which=which, asMates=FALSE, bigFile=TRUE)
and computed TSSEscore and gal1 and gal2 gave me the same value. Although gal2 is smaller in size and considered to be SE, not PE as my original BAM file without shifting.

Thank you and happy holidays to you too!

JIANHONG OU · Answer 7 · Fri Dec 23 2022 20:04:04 GMT+0800 (China Standard Time)

Hi Both,
Thank you for the informative discussion. Yes, you can try to clean your input bam file to decrease the memory cost. The thing is that, this make it not comparable with ENcode standard. If the score is much higher, it will be OK for downstream analysis. If the score is in the cutting range, then it will be confusion for people to decide what to do. And ENCODE also changed some default parameters (eg. change the range from 1000 to 2000) for the computation and one more smooth step is introduced to the computation. It is my fault that I did not keep tracking the changes which make this value not exactly follow the standard scripts of ENCODE now. Hope this message will help.