Limitations + Problems

Question

Limitations + Problems

sej917 opened this issue 12 years ago · comments

Hi John,

I recently read the Assemblathon 1 paper published in Genome Research and may be interested in using SimSeq for a project I am working on. In the paper, a few limitations and problems of SimSeq are mentioned. Have these been addressed/fixed?

Any information will be greatly appreciated!

Thanks.

John St. John · Answer 1 · Fri May 25 2012 06:36:32 GMT+0800 (China Standard Time)

Hello,
I haven't yet implemented the fix to get around the problem of the reads being sorted by chromosome on the output. Reads are random within chromosomes, but they are sorted between chromosomes. So that issue hasn't been fixed. One way around this would be to shuffle the order of fastq reads after you generate them. I also haven't implemented the improvement to the error profile where each position's error rate is dependent on the error of the previous position. Each position is still sampled independently of previous and subsequent positions, although there is a position specific model. The issue with a reverse error profile being used was fixed in that I provide a correct error model for download on the GitHub page; the incorrect ones are also available for download but are labeled as being specific to the Assemblathon1. I had also started working on adding in a simulation of adapter sequences showing up at the ends of reads when the fragments are too short, but I don't remember how far along I got with that, and I don't think the finished version of that ever made it to GitHub. The version I have for adding adapters to the simulation will do it on PE reads, but not MP reads.
Let me know if that answers your questions.
-John

On May 24, 2012, at 12:36 PM, sej917 wrote:

Hi John,

I recently read the Assemblathon 1 paper published in Genome Research and may be interested in using SimSeq for a project I am working on. In the paper, a few limitations and problems of SimSeq are mentioned. Have these been addressed/fixed?

Any information will be greatly appreciated!

Thanks.

Reply to this email directly or view it on GitHub:
#4

Yifang Tan · Answer 2 · Tue Dec 22 2015 02:19:33 GMT+0800 (China Standard Time)

Hello John!
I want to simulate 250bp read length from MiSeq, but it seems the read_len for PE reads (MP reads too, but I have not tried yet!) simulation can not be longer than 100bp. I changed the def_read_len to 300 in the "Main.java" source code, but did not work. I appreciate if you could let me know how to change the settings/code to implement my case. Thanks a lot!

Yifang

John St. John · Answer 3 · Tue Dec 22 2015 02:28:54 GMT+0800 (China Standard Time)

Ah, you would have to make a new model that supports longer reads as well,
which means you need sequencing data probably aligned to PhiX from the
instrument you want to emulate.
On Mon, Dec 21, 2015 at 10:19 AM Yifang Tan notifications@github.com
wrote:

Hello John!
I want to simulate 250bp read length from MiSeq, but it seems the read_len
for PE reads (MP reads too, but I have not tried yet!) simulation can not
be longer than 100bp. I changed the def_read_len to 300 in the "Main.java"
source code, but did not work. I appreciate if you could let me know how to
change the settings/code to implement my case. Thanks a lot!

Yifang

—
Reply to this email directly or view it on GitHub
#4 (comment).

Yifang Tan · Answer 4 · Tue Dec 22 2015 03:19:47 GMT+0800 (China Standard Time)

Thanks John!
Then, do you have a model for that, or how to make read error files? I had thought it is simple issue for you to modify, isn't it?
I also noticed a file for 250bp profiles there which looks similar to those two error files for 100bp reads. Is there anything to do with it to create error files for 250bp reads? Thanks!

John St. John · Answer 5 · Tue Dec 22 2015 03:24:52 GMT+0800 (China Standard Time)

It has been about 4 years since I last touched this. I think there is a c
program in one of the subdirectories you could use (after compiling with
Make). The 250bp error profile there would be from a few years ago when
quality was really bad at that read length.
On Mon, Dec 21, 2015 at 11:21 AM Yifang Tan notifications@github.com
wrote:

Thanks John!
Then, do you have a model for that, or how to make read error files? I had
thought it is simple issue for you to modify, isn't it?
I also noticed a file for 250bp profiles there which looks similar to
those two error files for 100bp reads. Is there anything to do with it to
create error files for 250bp reads? Thanks!

—
Reply to this email directly or view it on GitHub
#4 (comment).

Yifang Tan · Answer 6 · Tue Dec 22 2015 04:43:42 GMT+0800 (China Standard Time)

Thanks!
The program is located in cUtils subdfolder. In case somebody comes over in the future I put here two hickups I met, 1) was resolved:

undefined reference for log10() / pow() functions at compiling;
Tried putting the $(LDFLAGS) at the end of the line in makefile:
faSize: $(LIBOBJECTS) $(FASIZEO)
$(CC) $(COPTS) $(FASIZEO) $(LIBOBJECTS) -o $@ $(LDFLAGS)
The bug went away.
Segment fault when I run:
$ samtools view -S ../sample_data_20151221/S01-aln-pe.sam | ./getErrorProfile -ref=../sample_data_20151221/S01_Scaffolds_gt1kb.fasta -readLen=251 -phred33 >error_profile_251bp.txt

[samopen] SAM header is present: 1 sequences.
Segmentation fault

Any help would be appreciated! Yifang

John St. John · Answer 7 · Tue Dec 22 2015 05:19:28 GMT+0800 (China Standard Time)

Is the seg fault in my prog or santools?
On Mon, Dec 21, 2015 at 12:43 PM Yifang Tan notifications@github.com
wrote:

Thanks!
The program is located in cUtils subdfolder. In case somebody comes over
in the future I put here two hickups I met, 1) was resolved:

undefined reference for log10() / pow() functions at compiling;
Tried putting the $(LDFLAGS) at the end of the line in makefile:
faSize: $(LIBOBJECTS) $(FASIZEO)
$(CC) $(COPTS) $(FASIZEO) $(LIBOBJECTS) -o $@ $(LDFLAGS)
The bug went away.

Segment fault when I run:
$ samtools view -S ../sample_data_20151221/S01-aln-pe.sam |
./getErrorProfile -ref=../sample_data_20151221/S01_Scaffolds_gt1kb.fasta
-readLen=251 -phred33 >error_profile_251bp.txt

[samopen] SAM header is present: 1 sequences.
Segmentation fault

Any help would be appreciated! Yifang

—
Reply to this email directly or view it on GitHub
#4 (comment).

Yifang Tan · Answer 8 · Tue Dec 22 2015 05:41:18 GMT+0800 (China Standard Time)

I tried to view the sam file with samtools, which was fiine. So, I think it is with your program.
I tried two ways to feed the sam file:

by stdin:
samtools view -S ./S01-aln-pe.sam | ./getErrorProfile -ref=./S01_Scaffolds_gt1kb.fasta -readLen=251 -phred64 >error_profile_251bp.txt

[samopen] SAM header is present: 1 sequences.
Segmentation fault

by explicit sam file name (headless sam file)
./getErrorProfile -ref=./S01_Scaffolds_gt1kb.fasta -readLen=251 -phred33 -sam=./S01-aln-no-header-pe.sam
Segmentation fault

Did I miss anything?

John St. John · Answer 9 · Tue Dec 22 2015 06:22:37 GMT+0800 (China Standard Time)

Nope. You'll have to do GDB to find out which line the program is crashing
on. Ever used gdb before? I am away from a computer for a few weeks so
unfortunately I can't do that for you.
On Mon, Dec 21, 2015 at 1:41 PM Yifang Tan notifications@github.com wrote:

I tried to view the sam file with samtools, which was fiine. So, I think
it is with your program.
I tried two ways to feed the sam file:

by stdiin:
samtools view -S ./S01-aln-pe.sam | ./getErrorProfile
-ref=./S01_Scaffolds_gt1kb.fasta -readLen=251 -phred64

error_profile_251bp.txt

[samopen] SAM header is present: 1 sequences.
Segmentation fault

by explicit sam file name (headless sam file)
./getErrorProfile -ref=./S01_Scaffolds_gt1kb.fasta -readLen=251 -phred33
-sam=./S01-aln-no-header-pe.sam
Segmentation fault

Did I miss anything?

—
Reply to this email directly or view it on GitHub
#4 (comment).

Yifang Tan · Answer 10 · Tue Dec 22 2015 06:26:37 GMT+0800 (China Standard Time)

I'll try to figure that out, or come back to you later.
Thanks for your program and have great holidays!

UPDATE:
dbg indicated the bug is on line 229: Program received signal SIGSEGV, Segmentation fault. 0x00000000004038a3 in getErrorProfile () at getErrorProfile.c:229 if(mutation[i][pscore][0][0] < pseudo){ //add in pseudocounts for everything other than N, haven't done this already
Maybe this will give you some clue to help me out when you are back. Thanks a lot!