neufeld / pandaseq

PAired-eND Assembler for DNA sequences

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What happens when there is high overlap ?

xapple opened this issue · comments

I was wondering what happens, when instead of the usual case:

small_overlap

We have high overlap, going over each of the primers. What does pandseq do ?

high_overlap

I have a dataset like that, where the distance between the two primers is shorter than the read length.

This has been fixed. It will reconstruct the entire sequence, but stop at the forward and reverse primers.

OK that's good to know. Is it fixed in the master branch, or is Version 2.8 ok already ?

It's been fixed in 2.8 and later.

Thanks, good to know.

It is named as "readthough". The tool, trimomatic can do this well.

lastest version of pandaseq can only generate one third of the trimomatic output.

So you mean you have to activate a special option on the command line ?

I didn't find the word "readthough" in the pandaseq documentation.

PANDAseq doesn't require any command line options to deal with this. I believe @yech1990 is referring to the “trimomatic” software package.

I just tried it on one sample of my dataset. Primers are 967F and 1046R. So the peak should be around 120. But look at the output from pandaseq. Seems like it finds a much shorter overlap optimal in most cases:

assembled_len_dist.pdf

Ah sorry github only supports displaying PNG not PDF:

assembled_len_dist

The reads are 111 base pairs each, so a final sequence length of 220 sequences means that pandaseq chose to only make 2 base pairs overlap...

That might be true anyway if the sequences are repetitive. See if increasing the k-mer table to 4 (-k 4) improves the situation or try increasing the minimum overlap (-o 20).

Your suggestions did not work, instead of assembling more reads it just threw them away. Here:

minimum_overlap = 40
kmer_table_size = 4

assembled_len_dist

It took much longer on one core, but I used the default command of mothur:

mothur > make.contigs(ffastq=fwd.fastq, rfastq=rev.fastq)

And look at the result ! I think I'm going to go with that. Pity, I thought PANDAseq would be the more powerful option.

assembled_len_dist