nghiavtr / FuSeq

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing Fusions

mvheetve opened this issue · comments

Hi,

very satisfied with the speed and overall performance of the tool! We've analyzed 40 samples so far that were previously analyzed with FusionCatcher and for which we've experimentally validated several fusions. For 37/40 samples FuSeq results lead us to the same conclusions as FusionCatcher. For three samples however, we could not find the validated fusions in the FuSeq output or in the RData file. Could you advise on what might be the problem here? Fuseq parameters:
#parameter setting readStrands="RF" chromRef=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y onlyProteinCodingGenes=FALSE maxSharedCount=5e-2 minGeneDist=1e4 minJunctionDist=1e5 maxInvertedFusionCount=0 minMR=2 minNonDupMR=2 maxMRfusionFc=2 maxMRfusionNum=2 sgtMRcount=10 minSR=1 minScore=3 exonBoundary=FALSE keepRData=TRUE exportFasta=TRUE

Hi @mvheetve ,

Sorry for replying late. It is hard to know why three samples do not give the results from your description. If you can send me the output of FuSeq and the list of validated fusions, I will investigate those cases.

Best,
Nghia

Hi @nghiavtr,

thanks for replying. The .RData files for the three samples can be found here:
https://we.tl/t-DCmPUjJkP6
18B5649 should have an ALK (ENSG00000171094) rearrangement, M19-1214 a STIL-TAL (ENSG00000123473-ENSG00000162367) fusion and in M15-2021 we expect to see a BCR-ABL1 (ENSG00000186716-ENSG00000097007) fusion.

In 37/40 samples we identified the same fusion that was previously identified at another site without prior knowledge of the expected result. Just wondering if I'm missing something in these three samples.

Kind regards and thanks,
Mattias

Hi Mattias,

I have a check for the individual samples from the raw output of FuSeq, you can see the codes below. As the results, FuSeq discovered "STIL-TAL" as expected in sample "M19-1214_S15", but with only one supporting split read, therefore it was filtered out in the final results. Two other samples we can not find the expected fusions. I think might be due to the fusions have too low coverages.

I saw FuSeq.params$readStrands of "RF" which indicated the sequencing with strandness. If it is not truly stranded, we should set FuSeq.params$readStrands="UN", then FuSeq can consider also the reads with "FR" direction, thus increasing the sensitivity for low expressed fusions.

I hope this would help. Thanks again for using FuSeq in your research!

Best,
Nghia

##############
load("M19-1214_S15FuSeq_process.RData")
#expect STIL-TAL (ENSG00000123473-ENSG00000162367)
fg="ENSG00000123473-ENSG00000162367"

pick=FuSeq.SR$fusionGene$name12 %in% fg | FuSeq.SR$fusionGene$name21 %in% fg
x=FuSeq.SR$fusionGene[pick,]
x$supportCount #only 1 split read 

pick=FuSeq.MR$feqInfo$fgeList$name12 %in% fg | FuSeq.MR$feqInfo$fgeList$name21 %in% fg
y=FuSeq.MR$feqInfo$fgeList[pick,]
y$supportCount #no mapped reads


##############
load("M15-2021_S11FuSeq_process.RData")
#expect BCR-ABL1 (ENSG00000186716-ENSG00000097007)
fg="ENSG00000186716-ENSG00000097007"

pick=FuSeq.SR$fusionGene$name12 %in% fg | FuSeq.SR$fusionGene$name21 %in% fg
x=FuSeq.SR$fusionGene[pick,]
x$supportCount #no split reads

pick=FuSeq.MR$feqInfo$fgeList$name12 %in% fg | FuSeq.MR$feqInfo$fgeList$name21 %in% fg
y=FuSeq.MR$feqInfo$fgeList[pick,]
y$supportCount #no mapped reads

##############
load("18B5649_S5FuSeq_process.RData")
#expect ALK (ENSG00000171094) rearrangement
g="ENSG00000171094"

pick=FuSeq.SR$fusionGene$gene1 %in% g | FuSeq.SR$fusionGene$gene2 %in% g
x=FuSeq.SR$fusionGene[pick,]
x$supportCount #no split reads

pick=FuSeq.MR$feqInfo$fgeList$gene1 %in% g | FuSeq.MR$feqInfo$fgeList$gene2 %in% g
y=FuSeq.MR$feqInfo$fgeList[pick,]
y$supportCount #no mapped reads

Hi Nghia,

thank you for looking into this. I'll rerun the samples as an unstranded library and see what comes up. Might be that the data is just not in the fastqs, as the independent lab which provided us with the expected outcomes has informed me that these three samples were never sequenced, but fusions were discovered using FISH.

I'll keep you posted!
Mattias

Hi Nghia,

I found five ALK fusion reads in 18B5649 using the "UN" library setting. The output for the other two samples remained the same. I will try properly mapping the reads first in order to determine if they exist at all.

Thank you for the help
Mattias

Thank you and good lucks

Nghia

Hi it's me again :)

We're trying a new capture method and library prep for fusion detection and are therefore rerunning some samples. The STIL-TAL fusion that I asked about earlier in sample M19-1214 (had one split read) again wasn't reported. I fixed this by making not only the gene distance but also the junction distance smaller (10^5 -> 10^4), so that fixes that and it is now clearly reported.

Unfortunately the FUS-ERG we found in 18B11873 and the SS18-SSX2 in 19B18362 were not reported this time (they were before using the previous capture and prep and were validated using fish). The genes/junctions are far apart so that can't be it. For both there are a significant amount of split reads in the data but for some reason FuSeq doesn't report them anymore. For 19B18362 FuSeq does however report SS18-SSX1/3, which have less split reads than SS18-SSX2.

I've put the data of these samples here:
https://wetransfer.com/downloads/3e5e02829ca0336a74c38b4e5b91e8b120200525120455/1288a923a570e49a24bc6ccf5d39a34c20200525120521/8360f6

If you find the time it would be much appreciated if you could uncover why FuSeq doesn't retain them after filtering. I've try different parameters but no luck yet. Hoping you can make more sense of it.

Regards
Mattias

Dear Mattias,

Sorry for the late reply.

Well, I have investigated your samples with the missing fusions. The missing fusions are all discovered by FuSeq at the first step, you can check they appear in FuSeq.SR$myFusionFinal and/or FuSeq.MR$myFusionFinal.

However, I found the main reason for the missing is that one gene has too many partners. This makes the fusions do not pass the strict filters and statistical tests of FuSeq in the post-processing phases.

This, of course, can be solved by relaxing the conditions, however since the numbers of fusions from the output of the less strict filters and tests (FuSeq.SR$myFusionFinal and/or FuSeq.MR$myFusionFinal) are small, I would suggest you using directly fusion from those. Thus, we can increase sensitivity. So, the output is simply the combination of the fusions from them. I prepared a small code to collect the fusions from FuSeq.SR$myFusionFinal and FuSeq.MR$myFusionFinal below, hope it would help.

Best,
Nghia

load("FuSeq_process.RData")
#number of fusion genes from mapped reads and split reads
length(unique(FuSeq.SR$myFusionFinal$name12))
length(unique(FuSeq.MR$myFusionFinal$name21))

#process split read results
x=FuSeq.SR$myFusionFinal
fgeSetAll=paste(x$name12,x$brchposEx5,x$brchposEx3,sep="__")
fgeSet=unique(fgeSetAll)
keepID=NULL
for (i in 1:length(fgeSet)){
  myID=which(fgeSetAll==fgeSet[i])
  keepID=c(keepID,myID[which.max(x$supportCount[myID])])
}
x=x[keepID,]
x=x[order(x$supportCount,decreasing = TRUE),]

#collect results
myFusionFinal.SR=x
myFusionFinal.MR=FuSeq.MR$myFusionFinal

# Finally, myFusionFinal.SR and myFusionFinal.MR can be combined together to get the final fusion list. We also can filter them out by count (>=2) if needed
# Good lucks