jstjohn / SimSeq

An illumina paired-end and mate-pair short read simulator. This project attempts to model as many of the quirks that exist in Illumina data as possible. Some of these quirks include the potential for chimeric reads, and non-biotinylated fragment pull down in mate-pair libraries . Additionally the program provides the ability to model both site and base specific error, and scripts are provided to train this error model on real datasets. My hope in creating this program is to generate as realistic data as possible to assist in assessing the accuracy of genome assembly tools.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Quality Error Spectrum

sari-khaleel opened this issue · comments

Hi Again John,
I am trying to manipulate the error rates of simulated errors. I see that you've done that for the SimSeqNBProject using error_profile.txt (https://github.com/jstjohn/SimSeq/blob/master/SimSeqNBProject/test/error_profile.txt).

Can you briefly explain:

  1. How to use error_profile.txt with SimSeq
  2. How the "Mutation Spectrum" and "Quality Histograms" rows work in error_profile.txt

Thanks!
Sari

Hey Sari,
See the /examples directory. The SimSeqNBProject/test directory is old and unused. Now (and also for the Assemblathon) the error profiles look like this:
https://github.com/jstjohn/SimSeq/blob/master/examples/hiseq_mito_default_bwa_mapping_mq10_1.txt

This is the program that you can use to make new error profiles given a sam alignment you want to empirically model:
https://github.com/jstjohn/SimSeq/blob/master/cUtils/getErrorProfile.c

The first line in the output error profile (you can see it in the example I provided) gives a little information about what the columns mean. Basically the first column is the read position. The second column is a phred score at a given read position, and the remainder of the columns show the observations of each base in a read conditional on the underlying reference base (i.e. G->A means reference = G and read = A).

If you want to set these by hand there are some things to keep in mind. First off the sum of the observations for a given phred score at a given base define the probability that phred score is chosen at that base. For example if you decided to distribute 10k observations between each of the substitutions, and do that for each phred score, then each phred score would be sampled with equal probability because they all have 10k observations.

The other thing to keep in mind is that you can't simulate read lengths longer than your model, but you can simulate shorter ones.

Let me know if this answers your questions!
-John

On Oct 8, 2011, at 9:50 PM, sari-khaleel wrote:

Hi Again John,
I am trying to manipulate the error rates of simulated errors. I see that you've done that for the SimSeqNBProject using error_profile.txt (https://github.com/jstjohn/SimSeq/blob/master/SimSeqNBProject/test/error_profile.txt).

Can you briefly explain:

  1. How to use error_profile.txt with SimSeq
  2. How the "Mutation Spectrum" and "Quality Histograms" rows work in error_profile.txt

Thanks!
Sari

Reply to this email directly or view it on GitHub:
#2

I have a related question. I'm not totally clear on the correct way to modify the errorProfiles that you provided, so I was hoping to create my own. I used the getErrorProfile command that you specified, but I keep getting a segmentation fault on running and I'm not sure why. I can't find much documentation for that command, so I'm not sure if I'm running something wrong or if I need to modify the file I'm using to create the error file.

Any insight on this? How do you use the getErrorProfile command?

Thanks,
-Jeremy Hyrkas

Hey Jeremy. I have been really busy over the past few days.

"#Pos\tPhred\tA->A\tA->C\tA->G\tA->T\tA->N\tC->A\tC->C\tC->G\tC->T\tC->N\tG->A\tG->C\tG->G\tG->T\tG->N\tT->A\tT->C\tT->G\tT->T\tT->N\n"

So the error profile actually defines observations in a dataset (although you can fake these to make them more extreme). The first two columns are the position in the read, and the phred score of that position. There can be multiple phred scores for each position so the first column is often repeated multiple times. The rest of the columns are for the number of times you see each of the cases of Reference_Base->Observed_Base.

So if you run getErrorProfile without arguments it should tell you what arguments to run. The important thing is that the sam input you give it doesn't have any header information (the lines that start with "@" at the beginning of a sam file. To get that either output a sam file without a head. Or pipe in a sam file from the output of samtools view something.bam | getErrorProfile [required opts] > output.errorProfile

Again though the getErrorProfile will just give you the empirical error profile of a given Bam alignment. If you want to increase the error rate, then you need to go through the error profile and add observation counts to the types of error you want to see more of in your output file, given a reference base, position, and phred score.

Hope this helps!