Is there a way I can exactly get genomic coordinate of the fusion breakpoint(left and right), the output(FuSeq_WES_SR_fge.txt or FuSeq_WES_SR_fge_fdb) contains a transcript range rather than the exact breakpoint of the fusion genes. Alternatively, how can I get exon number(or say ranks) or even intron number in output itself?

Question

Is there a way I can exactly get genomic coordinate of the fusion breakpoint(left and right), the output(FuSeq_WES_SR_fge.txt or FuSeq_WES_SR_fge_fdb) contains a transcript range rather than the exact breakpoint of the fusion genes. Alternatively, how can I get exon number(or say ranks) or even intron number in output itself?

spKrispy opened this issue 2 years ago · comments

Satya Prakash Khuntia commented 2 years ago

All this is because I am getting more than 1300-1500 fusion calls in my samples, fdb option is working great. But some fusions are of significance, only if they occur between certain exons. For example, ALK fusions are of significance if they occur on exon 20, 21, 19, and 18.

Trung Nghia Vu · Answer 1 · Thu Jun 09 2022 19:48:44 GMT+0800 (China Standard Time)

Hi @spKrispy,

If you use the information of split reads, you have the mapping information of two pieces of a split read, one for 3' partner gene and another one for 5' partner gene. Thus, technically you can get the exact position of the breakpoints in the chromosome using mapping information of split reads.

I think you can use directly the results from the *.BEDPE file
https://github.com/nghiavtr/FuSeq_WES/blob/main/convert_all_split_reads_bedpe.R as the discussed in this issue #2
For fusion gene A-B, the breakpoints should be at columns "end1" of gene A and "start2" of gene B in the *.BEDPE file.

To identify the exon carrying/neighboring the breakpoints, you can use the information from the annotation reference. For eaxmple, if using hg38, you can run these scripts to build sqlite file of the annotation (hg38.refGene.sqlite)

#download the gtf of hg38 containing gene annotation 
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
gunzip hg38.refGene.gtf.gz
#create sqlite
Rscript FuSeq_WES_v1.0.0/createSqlite.R hg38.refGene.gtf hg38.refGene.sqlite

Then, using R, you can get exon information of a gene and compare it to your breakpoint, simply like below

## in R
library(GenomicFeatures)
anntxdb <- loadDb("hg38.refGene.sqlite")
allgenes=genes(anntxdb,single.strand.genes.only=FALSE)
#get exon data of each all genes
genes.exon.map=select(anntxdb, keys=names(allgenes), columns=c("GENEID","EXONID","EXONSTART","EXONEND","EXONSTRAND","EXONCHROM"), keytype = "GENEID")

# an example of 1 break-point
mygene="ANKRD20A8P"
mychr="chr2"
myBr=94761039

#check the distance of this break-point to each exon
myexonDat=genes.exon.map[genes.exon.map$GENEID==mygene & genes.exon.map$EXONCHROM == mychr,]
#get distance to each end of every exon
myexonDat$dis2start=myexonDat$EXONSTART - myBr
myexonDat$dis2end=myexonDat$EXONEND - myBr
myexonDat

Best,
Nghia

Johnny Zhang · Answer 2 · Fri Nov 24 2023 11:26:12 GMT+0800 (China Standard Time)

Hi @spKrispy,

If you use the information of split reads, you have the mapping information of two pieces of a split read, one for 3' partner gene and another one for 5' partner gene. Thus, technically you can get the exact position of the breakpoints in the chromosome using mapping information of split reads.

I think you can use directly the results from the *.BEDPE file https://github.com/nghiavtr/FuSeq_WES/blob/main/convert_all_split_reads_bedpe.R as the discussed in this issue #2 For fusion gene A-B, the breakpoints should be at columns "end1" of gene A and "start2" of gene B in the *.BEDPE file.

To identify the exon carrying/neighboring the breakpoints, you can use the information from the annotation reference. For eaxmple, if using hg38, you can run these scripts to build sqlite file of the annotation (hg38.refGene.sqlite)
#download the gtf of hg38 containing gene annotation 
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
gunzip hg38.refGene.gtf.gz
#create sqlite
Rscript FuSeq_WES_v1.0.0/createSqlite.R hg38.refGene.gtf hg38.refGene.sqlite
Then, using R, you can get exon information of a gene and compare it to your breakpoint, simply like below
## in R
library(GenomicFeatures)
anntxdb <- loadDb("hg38.refGene.sqlite")
allgenes=genes(anntxdb,single.strand.genes.only=FALSE)
#get exon data of each all genes
genes.exon.map=select(anntxdb, keys=names(allgenes), columns=c("GENEID","EXONID","EXONSTART","EXONEND","EXONSTRAND","EXONCHROM"), keytype = "GENEID")

# an example of 1 break-point
mygene="ANKRD20A8P"
mychr="chr2"
myBr=94761039

#check the distance of this break-point to each exon
myexonDat=genes.exon.map[genes.exon.map$GENEID==mygene & genes.exon.map$EXONCHROM == mychr,]
#get distance to each end of every exon
myexonDat$dis2start=myexonDat$EXONSTART - myBr
myexonDat$dis2end=myexonDat$EXONEND - myBr
myexonDat
Best, Nghia

Hi. I tested the script mentioned above (convert_all_split_reads_bedpe.R). And I don't think the colums "end1" of gene A and "start2" of gene B in the *.BEDPE file are the breakpoints of a fusion.

Trung Nghia Vu · Answer 3 · Sun Nov 26 2023 15:44:35 GMT+0800 (China Standard Time)

Dear @yiyinzhang ,

Thank you for using FuSeq_WES.
Theoretically, if a split read of geneA-geneB fusion, it should contain the breakpoint of the fusion. However, the calculation in the codes can be not correct for complicated cases, for example (I guess) too many split reads of the same genes but the breakpoints are highly heterogeneous.
Could you please provide your split read data of a fusion that the script does not work and your expected breakpoint? So I would revise to improve the script.
Many thanks!

Best,
Nghia

Johnny Zhang · Answer 4 · Fri Dec 29 2023 18:32:40 GMT+0800 (China Standard Time)

Hi @nghiavtr ，

Sorry for my late response.

I examined the code in the script convert_all_split_reads_bedpe.R. And I found the code below may not be correct. This block of code just looks for the min start position and the max end position for all the tx of a fusion name. And then these positions will be regarded as start1, end1, start2 and end2 for a fusion pair. It doesn't make sense.

minStart1=tapply(bedpe$start1,bedpe$name,min)
maxEnd1=tapply(bedpe$end1,bedpe$name,max)
minStart2=tapply(bedpe$start2,bedpe$name,min)
maxEnd2=tapply(bedpe$end2,bedpe$name,max)

I wrote a script to replace these block of code. The code below looks for the start and end position of the most occurence for all the tx of a fusion name. And then it will be regarded as the breakpoint.

mode_fun <- function(x){
    return(as.numeric(names(table(x))[table(x) == max(table(x))]))
}
minStart1=tapply(bedpe$start1,bedpe$name,mode_fun)
maxEnd1=tapply(bedpe$end1,bedpe$name,mode_fun)
minStart2=tapply(bedpe$start2,bedpe$name,mode_fun)
maxEnd2=tapply(bedpe$end2,bedpe$name,mode_fun)

I tested the code I wrote and It had a good inference of break point for gene fusion in my panel NGS data.