What happens when there is high overlap ?

Question

What happens when there is high overlap ?

xapple opened this issue 8 years ago · comments

Lucas Sinclair commented 8 years ago

I was wondering what happens, when instead of the usual case:

We have high overlap, going over each of the primers. What does pandseq do ?

I have a dataset like that, where the distance between the two primers is shorter than the read length.

Andre Masella · Answer 1 · Sun May 15 2016 04:38:14 GMT+0800 (China Standard Time)

This has been fixed. It will reconstruct the entire sequence, but stop at the forward and reverse primers.

Lucas Sinclair · Answer 2 · Sun May 15 2016 04:39:06 GMT+0800 (China Standard Time)

OK that's good to know. Is it fixed in the master branch, or is Version 2.8 ok already ?

Andre Masella · Answer 3 · Sun May 15 2016 10:26:24 GMT+0800 (China Standard Time)

It's been fixed in 2.8 and later.

Lucas Sinclair · Answer 4 · Sun May 15 2016 19:41:17 GMT+0800 (China Standard Time)

Thanks, good to know.

Chang Y · Answer 5 · Sun May 15 2016 20:19:08 GMT+0800 (China Standard Time)

It is named as "readthough". The tool, trimomatic can do this well.

lastest version of pandaseq can only generate one third of the trimomatic output.

Lucas Sinclair · Answer 6 · Sun May 15 2016 21:12:02 GMT+0800 (China Standard Time)

So you mean you have to activate a special option on the command line ?

Lucas Sinclair · Answer 7 · Mon May 16 2016 03:19:26 GMT+0800 (China Standard Time)

I didn't find the word "readthough" in the pandaseq documentation.

Andre Masella · Answer 8 · Mon May 16 2016 05:55:41 GMT+0800 (China Standard Time)

PANDAseq doesn't require any command line options to deal with this. I believe @yech1990 is referring to the “trimomatic” software package.

Lucas Sinclair · Answer 9 · Mon May 16 2016 06:01:26 GMT+0800 (China Standard Time)

I just tried it on one sample of my dataset. Primers are 967F and 1046R. So the peak should be around 120. But look at the output from pandaseq. Seems like it finds a much shorter overlap optimal in most cases:

assembled_len_dist.pdf

Lucas Sinclair · Answer 10 · Mon May 16 2016 06:02:09 GMT+0800 (China Standard Time)

Ah sorry github only supports displaying PNG not PDF:

Lucas Sinclair · Answer 11 · Mon May 16 2016 06:03:52 GMT+0800 (China Standard Time)

The reads are 111 base pairs each, so a final sequence length of 220 sequences means that pandaseq chose to only make 2 base pairs overlap...

Andre Masella · Answer 12 · Mon May 16 2016 06:04:26 GMT+0800 (China Standard Time)

That might be true anyway if the sequences are repetitive. See if increasing the k-mer table to 4 (-k 4) improves the situation or try increasing the minimum overlap (-o 20).

Lucas Sinclair · Answer 13 · Mon May 16 2016 06:53:52 GMT+0800 (China Standard Time)

Your suggestions did not work, instead of assembling more reads it just threw them away. Here:

minimum_overlap = 40
kmer_table_size = 4

Lucas Sinclair · Answer 14 · Mon May 16 2016 07:33:02 GMT+0800 (China Standard Time)

It took much longer on one core, but I used the default command of mothur:

mothur > make.contigs(ffastq=fwd.fastq, rfastq=rev.fastq)

And look at the result ! I think I'm going to go with that. Pity, I thought PANDAseq would be the more powerful option.