vgl-hub / gfastats

A single fast and exhaustive tool for summary statistics and simultaneous *fa* (fasta, fastq, gfa [.gz]) genome assembly file manipulation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

incorrect scaffold stats for very large scaffolds

mcshane opened this issue · comments

Hi @gf777 . We have a plant assembly where we have scaffolds of length up to 10-11G. See stats below. Notice the scaffold length is smaller than the contig length. Suspect you have a silent overflow somewhere when the counts get bigger than int32 (2^32). The contig stats look correct in this case.

+++Summary+++:
# scaffolds: 5326
Total scaffold length: 55785406534
Average scaffold length: 10474165.70
Scaffold N50: 2397495093
Scaffold auN: 2371627289.17
Scaffold L50: 8
Largest scaffold: 4269982256
# contigs: 11730
Total contig length: 103028768190
Average contig length: 8783356.20
Contig N50: 33299704
Contig auN: 41403720.17
Contig L50: 926
Largest contig: 198688677
# gaps in scaffolds: 6393
Total gap length in scaffolds: 1278600
Average gap length in scaffolds: 200.00
Gap N50 in scaffolds: 200
Gap auN in scaffolds: 200.00
Gap L50 in scaffolds: 3197
Largest gap in scaffolds: 200
Base composition (A:C:G:T): 35320906993:16056332857:16054669442:35317470900
GC content %: 31.25
# soft-masked bases: 0
# segments: 11730
Total segment length: 103028768190
Average segment length: 8783356.20
# gaps: 6393
# paths: 5326

Thank you @mcshane, it's should have been addressed now. It seems like it was originating from the size() function of string, that is bound to return t_size int.