Quality Error Spectrum

Question

Quality Error Spectrum

sari-khaleel opened this issue 13 years ago · comments

Hi Again John,
I am trying to manipulate the error rates of simulated errors. I see that you've done that for the SimSeqNBProject using error_profile.txt (https://github.com/jstjohn/SimSeq/blob/master/SimSeqNBProject/test/error_profile.txt).

Can you briefly explain:

How to use error_profile.txt with SimSeq
How the "Mutation Spectrum" and "Quality Histograms" rows work in error_profile.txt

Thanks!
Sari

John St. John · Answer 1 · Sun Oct 09 2011 13:08:08 GMT+0800 (China Standard Time)

Hey Sari,
See the /examples directory. The SimSeqNBProject/test directory is old and unused. Now (and also for the Assemblathon) the error profiles look like this:
https://github.com/jstjohn/SimSeq/blob/master/examples/hiseq_mito_default_bwa_mapping_mq10_1.txt

This is the program that you can use to make new error profiles given a sam alignment you want to empirically model:
https://github.com/jstjohn/SimSeq/blob/master/cUtils/getErrorProfile.c

The first line in the output error profile (you can see it in the example I provided) gives a little information about what the columns mean. Basically the first column is the read position. The second column is a phred score at a given read position, and the remainder of the columns show the observations of each base in a read conditional on the underlying reference base (i.e. G->A means reference = G and read = A).

If you want to set these by hand there are some things to keep in mind. First off the sum of the observations for a given phred score at a given base define the probability that phred score is chosen at that base. For example if you decided to distribute 10k observations between each of the substitutions, and do that for each phred score, then each phred score would be sampled with equal probability because they all have 10k observations.

The other thing to keep in mind is that you can't simulate read lengths longer than your model, but you can simulate shorter ones.

Let me know if this answers your questions!
-John

On Oct 8, 2011, at 9:50 PM, sari-khaleel wrote:

Hi Again John,
I am trying to manipulate the error rates of simulated errors. I see that you've done that for the SimSeqNBProject using error_profile.txt (https://github.com/jstjohn/SimSeq/blob/master/SimSeqNBProject/test/error_profile.txt).

Can you briefly explain:

How to use error_profile.txt with SimSeq

How the "Mutation Spectrum" and "Quality Histograms" rows work in error_profile.txt

Thanks!
Sari

Reply to this email directly or view it on GitHub:
#2

Jeremy Hyrkas · Answer 2 · Sat Oct 29 2011 15:44:55 GMT+0800 (China Standard Time)

I have a related question. I'm not totally clear on the correct way to modify the errorProfiles that you provided, so I was hoping to create my own. I used the getErrorProfile command that you specified, but I keep getting a segmentation fault on running and I'm not sure why. I can't find much documentation for that command, so I'm not sure if I'm running something wrong or if I need to modify the file I'm using to create the error file.

Any insight on this? How do you use the getErrorProfile command?

Thanks,
-Jeremy Hyrkas

John St. John · Answer 3 · Sun Oct 30 2011 00:07:47 GMT+0800 (China Standard Time)

Hey Jeremy. I have been really busy over the past few days.

"#Pos\tPhred\tA->A\tA->C\tA->G\tA->T\tA->N\tC->A\tC->C\tC->G\tC->T\tC->N\tG->A\tG->C\tG->G\tG->T\tG->N\tT->A\tT->C\tT->G\tT->T\tT->N\n"

So the error profile actually defines observations in a dataset (although you can fake these to make them more extreme). The first two columns are the position in the read, and the phred score of that position. There can be multiple phred scores for each position so the first column is often repeated multiple times. The rest of the columns are for the number of times you see each of the cases of Reference_Base->Observed_Base.

So if you run getErrorProfile without arguments it should tell you what arguments to run. The important thing is that the sam input you give it doesn't have any header information (the lines that start with "@" at the beginning of a sam file. To get that either output a sam file without a head. Or pipe in a sam file from the output of samtools view something.bam | getErrorProfile [required opts] > output.errorProfile

Again though the getErrorProfile will just give you the empirical error profile of a given Bam alignment. If you want to increase the error rate, then you need to go through the error profile and add observation counts to the types of error you want to see more of in your output file, given a reference base, position, and phred score.

Hope this helps!