nghiavtr / FuSeq

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FuSeq is failing to produce results on my simulated Reads?

unique379r opened this issue · comments

Hi

I am trying to run FuSeq for my simulated data merged with some known fusion generated from Eric-script.
Read length= 101, depth = ~47M
but getting nothing while reaching to FuSeq.R step with the following errors:

Read length 100 fragLengthMedian 201 fragLengthMean 201.157 fragLengthSd 77.7927Observed Fragments 47733257 Mapped Fragments 42908645 Total hits 110728134 kmer 31
[FuSeq] -- Extracting equivalence class 
[FuSeq] -- Extracting RR fusion equivalence classes 
[FuSeq] -- Extracting RF fusion equivalence classes 
[FuSeq] -- Extracting FF fusion equivalence classes 
[FuSeq] -- Extracting FR fusion equivalence classes 
[FuSeq] -- Extracting UN fusion equivalence classes 
[2019-10-03 14:31:19.737] [jointLog] [info] Finished exporting fusion-equivalence classes

Number of arguments:  6
List of arguments:  in=MixedBEsim1_feqDir txfasta=/home/keshar/FuSeq/ref_files/Homo_sapiens.GRCh38.cdna.all.clean.fa sqlite=/home/keshar/FuSeq/ref_files/Homo_sapiens.GRCh38.94.sqlite txanno=/home/keshar/FuSeq/ref_files/Homo_sapiens.GRCh38.94.txAnno.RData out=MixedBEsim1_FuseqOut params=/home/keshar/FuSeq/FuSeq_v1.1.2_linux_x86-64/R/params.txt
-----
Parameter settings:
 readStrands= UN
 chromRef=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y
 maxSharedCount= 0.05
 onlyProteinCodingGenes= TRUE
 minGeneDist= 1e+05
 minJunctionDist= 1e+05
 maxInvertedFusionCount= 0
 maxMRfusionFc= 2
 maxMRfusionNum= 2
 sgtMRcount= 10
 minMR= 2
 minNonDupMR= 2
 minSR= 1
 minScore= 3
 exonBoundary= TRUE
 keepRData= TRUE
 exportFasta= FALSEThere is no fragmentDist.txt file, your input sequencing data are probably too small. Stop!

Any clues ?? why I gets nothing ?? I understand that "fragmentDist.txt" is not generated FuSeq but why I have no idea !!

Used commands:

## Extract fusion equivalence classes and split reads
$FuSeqbin/FuSeq -i $Ref/TxIndexer_idx_k31 -l IU \
				-1 $R1 \
				-2 $R2 \
				-p 2 -g $Ref/Homo_sapiens.GRCh38.94.gtf \
				-o "$output"_feqDir


## Discover fusion genes
Rscript $FuSeqR/FuSeq.R in="$output"_feqDir \
						txfasta=$Ref/Homo_sapiens.GRCh38.cdna.all.clean.fa \
						sqlite=$Ref/Homo_sapiens.GRCh38.94.sqlite \
						txanno=$Ref/Homo_sapiens.GRCh38.94.txAnno.RData \
						out="$output"_FuseqOut \
						params=$FuSeqR/params.txt

Simulated Input looks like:

grep -A4 '@ENSG00000103253' Mixed.RSEM.Eric.header.R1.fastq | head
@ENSG00000103253----ENSG00000009950_926_1171_0:0:0_0:0:0_0/1/1
CACAGCCTAGGGTTGTGGCCACCCCGCATCCCCCAGATGGTCCTCTGGCCACCAGACGCCTGTAGACAGAGGTTCTAGGAGGGGGTGACAGTCTTGCCAAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@ENSG00000103253----ENSG00000009950_247_459_0:0:0_0:0:0_1/1/1
TGCCAGGCCCTGGGACTCTGAGCGTCCGTGTCTCTCCCCCGCAACCCATCCTCAGCCGGGGCCGTCCAGACAGCAACAAGACCGAGAACCGGCGTATCACA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@ENSG00000103253----ENSG00000009950_470_682_0:0:0_0:0:0_2/1/1
TTGTGCAGCGTACGGGTTCGGACGTAGTCATCAAACATGTCTCGCATCTGGTCAAAACGCTGGTGTGTGATGGGTACCCCTGTGGCGGGCAGCTGCTGCTG

head Mixed.RSEM.Eric.header.R1.fastq
@0_1_14572_181_246-ENSG00000181617/1
TCCAAGATTTCCATGGTTTAGACGTAATTTTCCTATTCCAATACCTGAATCTGCCCCTACAACTCCCCTTCCTAGCGAAAAGTAAACAAGAAGGAAAAGT
+
IJJJIHADDDDDCEEEB<CCCCJJJIJJIJJJJJJJJJJJIGIJJIIIIJIIIEDDDEGJIJIIHAAEFBDDDDDDDDDEDEECC@EEGGEEEA<5:7:C
@10000000_1_6217_3572_117-ENSG00000130559/1
AGTCTGTGCACCGGGAAGAGTCGTGCGGCAACTCCGGCACCAAGTGCTCCTCCACCCTGCAGGGCGAGGTGAGGGGAACAGGTACAGGGGACAGGTACTT
+
###########################@@:::DDDDFHHHIJJJJJIJIGGIIIIIJJIIJJJIIJJJJJIHFF@:B?CDDDCIIEA>IHDDDFFEE???
@10000001_0_9294_12067_170-ENSG00000151474/1
TTCCAGGGTGCAGAAGGGATTCATATTCCCAGAACGCTTTAAGTGTACACCTGCAGGATAAAGAGATACCGGTTACATTATTAAATGATTCTAGGAATTC

Hope Extra 1 nucleotide from simulated and Eric's known fusion plus header /1/1 vs /1 creates no problem for FuSeq.

Rupesh.

Hi Rupesh,

The header of reads is not the problem. Actually fragmentDist.txt is generated by FuSeq using the mapping information of fragment length of the wild-type transcripts in the data. If the number of mapped reads for the wild-type transcripts is not enough, it will not be able to calculate to produce fragmentDist.txt. So I guess the problem might be that the simulated data do not have enough mapped reads for wild-type transcripts. This problem is very hard to encounter in real data.

Best,
Nghia

Hi Nghia
thanks for your clarification but for the same simulated reads FusionCatcher was able to pick the calls and out of 43 known fusions, fusionCatcher predicted 36 fusions, in which 26 were true fusions. So I wonder, why your tool was not able to get that.

So it may be quasi-mapping problem compare to reference-based mapping by bowtie, bwa etc
Any inputs on it ?

by the way what are these stats??

Read length 100 fragLengthMedian 201 fragLengthMean 201.157 fragLengthSd 77.7927Observed Fragments 47733257 Mapped Fragments 42908645 Total hits 110728134 kmer 31

If you say, i can send you the log directory in order to investigate further.

Oh it is very weird because the fragmentDist.txt is just generated from a simple counting step.
In my opinion, quasi-mapping is not a big problem because it simply does the mapping. As long as there is a mapped fragment, it will update to produce fragmentDist.txt.

In your case, there are a lot of mapped fragments, so it should produce the file. I have no idea why the file was not created. I just got this error before for a sample with a very small number of reads. If you do not mind to send the simulated data, I will run FuSeq to check this problem.

Nghia

Hi Nghia
I don't mind to sen you the fastq files but its too big to send via google drive. If you have any temperory ftp-server, then please send me the link via email so i could try to upload it, otherwise i have to ask My IT if they can create a link, which is not likely they do.

Hi Rupesh,

Thank you for sending your simulated data. I have run FuSeq on your data, with default run of FuSeq and have the same results: no fragmentDist.txt file is created.

From the log of the run I saw a warning:

" [jointLog] [warning] Sailfish saw fewer than 10000 uniquely mapped reads so 200 will be used as the mean fragment length and 80 as the standard deviation for effective length correction"

This is the reason why file fragmentDist.txt is not created because it does not found enough confident reads (herein uniquely mapped reads to the transcripts of the transcriptome, not fusions) to report the fragment distribution to file.

I don't know your simulation design, but generally, to avoid this, you just simply generate more reads of the transcripts. In real data, the reads usually are there naturally.

Best,
Nghia

Hi Nghia
thanks for your explanation but I was wondering why my simulated reads were not working for FuSeq only while i got the results from Eric scripts and FusionCatcher.

Simulation design:

  1. I used RSEM simulator to generates reads by using real background information from the real sequencing data like expressions, noise, read length etc
  2. Then on top of that i used Eric simulator to get known fusion.
  3. Finally, I merged both reads by using their expressions.

Hi Rupesh,

I think your simulation design looks fine, however your final simulated data must have some problems. I tested your simulated data by using Salmon to quantify isoform expression, using hg19 annotation. From the log file (see below), the mapping rate is only 2.3562%, so almost 98% reads are unmapped which is very unusual. Is the mapping rate your purpose as well? If not, it must have some issues in your simulated data.

If you still insist on using FuSeq for this data, it is not a big deal that I would revise the tool to produce a fragmentDist.txt file for the kind of data (with a very low mapping rate) to overcome this error.

Nghia

log when running salmon for the simulated data:

Observed 47733257 total fragments (47733257 in most recent round)

[2019-10-25 14:09:59.781] [jointLog] [info] Thread saw mini-batch with a maximum of 88.36% zero probability fragments
[2019-10-25 14:10:00.033] [jointLog] [info] Computed 127,493 rich equivalence classes for further processing
[2019-10-25 14:10:00.033] [jointLog] [info] Counted 1,124,691 total reads in the equivalence classes
[2019-10-25 14:10:00.083] [jointLog] [warning] Only 1124691 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2019-10-25 14:10:00.083] [jointLog] [info] Mapping rate = 2.3562%

I am getting a similar error for all my patient samples

STEP1:
FuSeq -i ... .... ... ...

[2021-02-16 12:13:15.970] [jointLog] [info] Gathered fragment lengths from all threads
[2021-02-16 12:13:15.970] [jointLog] [warning] Sailfish saw fewer then 10000 uniquely mapped reads so 200 will be used as the mean fragment length and 80 as the standard deviation for effective length correction
[2021-02-16 12:13:15.973] [jointLog] [info] Estimating effective lengths
[2021-02-16 12:13:16.115] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2021-02-16 12:13:16.115] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2021-02-16 12:13:16.240] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2021-02-16 12:13:16.240] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2021-02-16 12:13:16.368] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2021-02-16 12:13:16.368] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2021-02-16 12:13:16.497] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2021-02-16 12:13:16.497] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2021-02-16 12:13:16.626] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2021-02-16 12:13:16.626] [jointLog] [info] Counted 0 total reads in the equivalence classes
[2021-02-16 12:13:16.756] [jointLog] [info] Computed 0 rich equivalence classes for further processing
[2021-02-16 12:13:16.756] [jointLog] [info] Counted 0 total reads in the equivalence classes

Step2
Rscript $FUSEQ_R/FuSeq.R ..... ... .... ....

Warning: The output directory is already existed, old results will be over written

Parameter settings:
readStrands= UN
chromRef=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y
maxSharedCount= 0.05
onlyProteinCodingGenes= TRUE
minGeneDist= 1e+05
minJunctionDist= 1e+05
maxInvertedFusionCount= 0
maxMRfusionFc= 2
maxMRfusionNum= 2
sgtMRcount= 10
minMR= 2
minNonDupMR= 2
minSR= 1
minScore= 3
exonBoundary= TRUE
keepRData= TRUE
exportFasta= FALSE


Processing mapped reads (MR) from dataset: fuseq_test read strands: UN
Read fusion equivalence classes
Reading fuseq_test/feq_UN.txt
The total number of fusion equivalence classes: 0

Processing split reads (SR) from dataset: fuseq_test read strands: UN
Get split reads ...Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Calls: processSplitRead -> read.csv -> read.table
In addition: Warning messages:
1: In max(feqR$Feq) : no non-missing arguments to max; returning -Inf
2: In min(fragRg) : no non-missing arguments to min; returning Inf
3: In max(fragRg) : no non-missing arguments to max; returning -Inf
Execution halted

Sailfish seems to be causing some issues. Can you please help me to resolve the issue?

Hi @sentisci,

Your log file shows that there are none of reads mapped to the transcript references. I guess there must be some problems with the reference indexing process. If not, can you send me a sample of your fastq files, I would have a check.

Best,
Nghia