jstjohn / SimSeq

An illumina paired-end and mate-pair short read simulator. This project attempts to model as many of the quirks that exist in Illumina data as possible. Some of these quirks include the potential for chimeric reads, and non-biotinylated fragment pull down in mate-pair libraries . Additionally the program provides the ability to model both site and base specific error, and scripts are provided to train this error model on real datasets. My hope in creating this program is to generate as realistic data as possible to assist in assessing the accuracy of genome assembly tools.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Limitations + Problems

sej917 opened this issue · comments

Hi John,

I recently read the Assemblathon 1 paper published in Genome Research and may be interested in using SimSeq for a project I am working on. In the paper, a few limitations and problems of SimSeq are mentioned. Have these been addressed/fixed?

Any information will be greatly appreciated!

Thanks.

Hello,
I haven't yet implemented the fix to get around the problem of the reads being sorted by chromosome on the output. Reads are random within chromosomes, but they are sorted between chromosomes. So that issue hasn't been fixed. One way around this would be to shuffle the order of fastq reads after you generate them. I also haven't implemented the improvement to the error profile where each position's error rate is dependent on the error of the previous position. Each position is still sampled independently of previous and subsequent positions, although there is a position specific model. The issue with a reverse error profile being used was fixed in that I provide a correct error model for download on the GitHub page; the incorrect ones are also available for download but are labeled as being specific to the Assemblathon1. I had also started working on adding in a simulation of adapter sequences showing up at the ends of reads when the fragments are too short, but I don't remember how far along I got with that, and I don't think the finished version of that ever made it to GitHub. The version I have for adding adapters to the simulation will do it on PE reads, but not MP reads.
Let me know if that answers your questions.
-John

On May 24, 2012, at 12:36 PM, sej917 wrote:

Hi John,

I recently read the Assemblathon 1 paper published in Genome Research and may be interested in using SimSeq for a project I am working on. In the paper, a few limitations and problems of SimSeq are mentioned. Have these been addressed/fixed?

Any information will be greatly appreciated!

Thanks.


Reply to this email directly or view it on GitHub:
#4

Hello John!
I want to simulate 250bp read length from MiSeq, but it seems the read_len for PE reads (MP reads too, but I have not tried yet!) simulation can not be longer than 100bp. I changed the def_read_len to 300 in the "Main.java" source code, but did not work. I appreciate if you could let me know how to change the settings/code to implement my case. Thanks a lot!

Yifang

Ah, you would have to make a new model that supports longer reads as well,
which means you need sequencing data probably aligned to PhiX from the
instrument you want to emulate.
On Mon, Dec 21, 2015 at 10:19 AM Yifang Tan notifications@github.com
wrote:

Hello John!
I want to simulate 250bp read length from MiSeq, but it seems the read_len
for PE reads (MP reads too, but I have not tried yet!) simulation can not
be longer than 100bp. I changed the def_read_len to 300 in the "Main.java"
source code, but did not work. I appreciate if you could let me know how to
change the settings/code to implement my case. Thanks a lot!

Yifang


Reply to this email directly or view it on GitHub
#4 (comment).

Thanks John!
Then, do you have a model for that, or how to make read error files? I had thought it is simple issue for you to modify, isn't it?
I also noticed a file for 250bp profiles there which looks similar to those two error files for 100bp reads. Is there anything to do with it to create error files for 250bp reads? Thanks!

It has been about 4 years since I last touched this. I think there is a c
program in one of the subdirectories you could use (after compiling with
Make). The 250bp error profile there would be from a few years ago when
quality was really bad at that read length.
On Mon, Dec 21, 2015 at 11:21 AM Yifang Tan notifications@github.com
wrote:

Thanks John!
Then, do you have a model for that, or how to make read error files? I had
thought it is simple issue for you to modify, isn't it?
I also noticed a file for 250bp profiles there which looks similar to
those two error files for 100bp reads. Is there anything to do with it to
create error files for 250bp reads? Thanks!


Reply to this email directly or view it on GitHub
#4 (comment).

Thanks!
The program is located in cUtils subdfolder. In case somebody comes over in the future I put here two hickups I met, 1) was resolved:

  1. undefined reference for log10() / pow() functions at compiling;
    Tried putting the $(LDFLAGS) at the end of the line in makefile:
    faSize: $(LIBOBJECTS) $(FASIZEO)
    $(CC) $(COPTS) $(FASIZEO) $(LIBOBJECTS) -o $@ $(LDFLAGS)
    The bug went away.
  2. Segment fault when I run:
    $ samtools view -S ../sample_data_20151221/S01-aln-pe.sam | ./getErrorProfile -ref=../sample_data_20151221/S01_Scaffolds_gt1kb.fasta -readLen=251 -phred33 >error_profile_251bp.txt

[samopen] SAM header is present: 1 sequences.
Segmentation fault

Any help would be appreciated! Yifang

Is the seg fault in my prog or santools?
On Mon, Dec 21, 2015 at 12:43 PM Yifang Tan notifications@github.com
wrote:

Thanks!
The program is located in cUtils subdfolder. In case somebody comes over
in the future I put here two hickups I met, 1) was resolved:

  1. undefined reference for log10() / pow() functions at compiling;
    Tried putting the $(LDFLAGS) at the end of the line in makefile:
    faSize: $(LIBOBJECTS) $(FASIZEO)
    $(CC) $(COPTS) $(FASIZEO) $(LIBOBJECTS) -o $@ $(LDFLAGS)
    The bug went away.
  2. Segment fault when I run:
    $ samtools view -S ../sample_data_20151221/S01-aln-pe.sam |
    ./getErrorProfile -ref=../sample_data_20151221/S01_Scaffolds_gt1kb.fasta
    -readLen=251 -phred33 >error_profile_251bp.txt

[samopen] SAM header is present: 1 sequences.
Segmentation fault

Any help would be appreciated! Yifang


Reply to this email directly or view it on GitHub
#4 (comment).

I tried to view the sam file with samtools, which was fiine. So, I think it is with your program.
I tried two ways to feed the sam file:

  1. by stdin:
    samtools view -S ./S01-aln-pe.sam | ./getErrorProfile -ref=./S01_Scaffolds_gt1kb.fasta -readLen=251 -phred64 >error_profile_251bp.txt

[samopen] SAM header is present: 1 sequences.
Segmentation fault

  1. by explicit sam file name (headless sam file)
    ./getErrorProfile -ref=./S01_Scaffolds_gt1kb.fasta -readLen=251 -phred33 -sam=./S01-aln-no-header-pe.sam
    Segmentation fault

Did I miss anything?

Nope. You'll have to do GDB to find out which line the program is crashing
on. Ever used gdb before? I am away from a computer for a few weeks so
unfortunately I can't do that for you.
On Mon, Dec 21, 2015 at 1:41 PM Yifang Tan notifications@github.com wrote:

I tried to view the sam file with samtools, which was fiine. So, I think
it is with your program.
I tried two ways to feed the sam file:

  1. by stdiin:
    samtools view -S ./S01-aln-pe.sam | ./getErrorProfile
    -ref=./S01_Scaffolds_gt1kb.fasta -readLen=251 -phred64

error_profile_251bp.txt

[samopen] SAM header is present: 1 sequences.
Segmentation fault

  1. by explicit sam file name (headless sam file)
    ./getErrorProfile -ref=./S01_Scaffolds_gt1kb.fasta -readLen=251 -phred33
    -sam=./S01-aln-no-header-pe.sam
    Segmentation fault

Did I miss anything?


Reply to this email directly or view it on GitHub
#4 (comment).

I'll try to figure that out, or come back to you later.
Thanks for your program and have great holidays!

UPDATE:
dbg indicated the bug is on line 229:
Program received signal SIGSEGV, Segmentation fault.
0x00000000004038a3 in getErrorProfile () at getErrorProfile.c:229
if(mutation[i][pscore][0][0] < pseudo){ //add in pseudocounts for everything other than N, haven't done this already

Maybe this will give you some clue to help me out when you are back. Thanks a lot!