marbl / canu

A single molecule sequence assembler for genomes large and small.

Home Page:http://canu.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

reads correct with different depth but same results

tianjio opened this issue · comments

Hi,
After applying the same correction parameters (for data with lengths greater than 100 kb), it is observed that the corrected output data of the 68X ONT dataset and the 45X ONT dataset show a reduction in coverage to approximately 22X~23X.I don't quite understand the reason for this result, could you please help me to answer it ?

The default correction targets 40x corrected coverage based on the genome size you provided which always ends up a bit lower due to trimming during correction. The report output by canu will give details on the read quality/read overlaps/and expected correction length. If you want to target more data for correction increase the corOutCoverage parameter from the default of 40 to whatever you would like to target.

--                             original      original
--                            raw reads     raw reads
--   category                w/overlaps  w/o/overlaps
--   -------------------- ------------- -------------
--   Number of Reads            1290708         32623
--   Number of Bases       199546206431    3526326367
--   Coverage                    62.358         1.102
--   Median                      137243        104854
--   Mean                        154602        108093
--   N50                         153475        105431
--   Minimum                     100000             0
--   Maximum                     896875        266020
--   
--                                        --------corrected---------   ----------rescued----------
--                             evidence                     expected                     expected
--   category                     reads            raw     corrected            raw     corrected
--   -------------------- -------------  ------------- -------------  ------------- -------------
--   Number of Reads            1314802         680694        680694           2804          2804
--   Number of Bases       202115581569   128633150876   128000057219      325824003     321230356
--   Coverage                    63.161         40.198        40.000          0.102         0.100
--   Median                      136305         171657        170633         117679        114659
--   Mean                        153723         188973        188043         116199        114561
--   N50                         152569         185228        184456         119088        116162
--   Minimum                     100000         131592        131591         100021        100008
--   Maximum                     896875         851143        851130         229161        131016
--   
--                        --------uncorrected--------
--                                           expected
--   category                       raw     corrected
--   -------------------- ------------- -------------
--   Number of Reads             639833        639833
--   Number of Bases        74113557919   54806947791
--   Coverage                    23.160        17.127
--   Median                      114557        107468
--   Mean                        115832         85658
--   N50                         115908        114395
--   Minimum                          0             0
--   Maximum                     896875        896874
--   
--   Maximum Memory          9872844314

My correct report is the above content. For example, if I increase the corOutCoverage parameter to 63, will this parameter increase the original read coverage of the correction to 63?

Yes, it looks like the reads have overlaps they are just shorter and being excluded from correction. There's no harm in setting coverage higher than your input (like 100) as it will obviously stop adding reads once you hit the total set.

Thanks.According to your suggestion, I increased the depth of reads for correction. It still needs some time to run and there are no results yet.
I have another question . I am testing the correction results of ONT haplotype data of homologous polyploid by different software. For canu software, could you give me some suggested parameters? The accuracy rate of ONT data is 96% or above.

Correction will always collapse similar haplotypes, typically below 1-2% divergence. I wouldn't recommend correction if your final goal is a diploid assembly and you have divergence below that. Given the high accuracy of the data, I'd use uncorrected ONT data instead following the quick start info here: https://canu.readthedocs.io/en/latest/quick-start.html#assembling-with-multiple-technologies-and-multiple-files which still does some much more conservative correction of the ONT data that would be haplotype preserving. Unfortunately, the corrected reads in this case are in hompolymer-compressed space.

Thanks.