Short basecalled R9.4.1 reads using both trained and default models in bonito, but not in guppy

Question

Short basecalled R9.4.1 reads using both trained and default models in bonito, but not in guppy

andreaswallberg opened this issue 8 months ago · comments

Dear developers,

We are training our own models for non-model organisms using R9.4.1 data, focusing our efforts towards data associated with protein-coding genes. When we use bonito to basecall our data using either our own trained model or the default models, we observe that the output reads are often only 50% as long as compared to the reads produced with guppy using the same FAST5 resources. We also get many more reads, especially when using our own model. While we are improving the base-level quality of the reads, we are concerned by the short resulting reads.

We wonder if there is some sort of clipping in the algorithm, e.g. that basecalling only proceeds until a point where the local quality score has dropped by some level compared to the overall quality of the read, and then terminates the calling at that point.

We have also noticed that older versions of bonito had default settings seemingly more in tune with R9 data, while newer versions seem to be adapted to R10 data. Perhaps this indirectly mirrors internal changes to the algorithms too, such that the current implementation of bonito does not fit legacy R9.4.1 data as well as it used to.

For the legacy R9.4.1 data we are currently exploring, would you recommend using an older version of bonito?