isovic / minialign

Minimalistic aligner which uses Minimap for input mapping locations and Edlib for fast bitvector alignment.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ERROR: Reads are not specified in a format which contains quality information. Exiting.

zeeev opened this issue · comments

Probably a stupid mistake?

    /net/eichler/vol8/home/zevk/tools/minialign_isovic/minialign/bin/minialign -t 1 /net/eichler/vol2/eee_shared/assemblies/hg38/ucsc.hg38.no_alts.fasta /net/eichler/vol18/zevk/great_apes/contig_breaking/clint/hi_c_scaffold/output/scaffolds.fasta raw_mini_map_out/q-clint-scaffold.t-hg38.txt > raw_aln_sam/q-clint-scaffold.t-hg38.sam

Ah yes, a leftover from Racon where we require quality values. I'll fix it in a few moments.

Should be fixed now - do a pull and then make modules again, there was a potential bug in the codebase.

Huh wait, I missed something with the quality - missing a star

Running now, I'll let you know.

There, all good now. The default rname value was wrong - an empty string instead of a star sign.
I also added the edit distance now.
Do a git pull and make modules; make -j.

Looks like it's going better.

[10:48:20 main] Using PAF for input alignments. (raw_mini_map_out/q-clint-scaffold-100Kb.t-hg38.txt)
[10:48:20 main] Loading reads.
[10:50:55 main] Hashing qnames.
[10:50:55 main] Parsing the overlaps file.
[10:51:00 AlignOverlaps] Aligning overlap: 9 / 210330 (0.00%), skipped 0 / 210330 (0.00%)

How long does it take for a whole genome alignment? I'm guessing overnight with 10 threads...?

Something to think about, Soft-clipping is drastically inflating my SAM, since i'm aligning large sequences.

Running now, I'll let you know.

If you've run it before my last commit, please re-run so you have a correct SAM output.

How long does it take for a whole genome alignment? I'm guessing overnight with 10 threads...?

Are you talking about the human genome? By the eye - in Racon it takes about 400 CPU mins (all threads together) for Edlib alignment (without SPOA).
Your estimate sounds reasonable, let me know how it goes :-)

Something to think about, Soft-clipping is drastically inflating my SAM, since i'm aligning large sequences.

That's true, I could add this feature soon.

To elaborate on the earlier point, I think most of the time is spent in IO. Each contig has at least a couple alignment records which means the whole contig is printed for each record.

Could you elaborate a bit more on the IO part? Are you referring to the loading of entire files into RAM as opposed to progressive processing?
Or that the output is slow because large contigs are output whereas only a small part is actually mapped and needed, and that's where hard clipping would come in handy?
Or both even :-)

I'm aligning a 50Mb contig to a reference genome. If the contig maps 10 places the SAM file has 50Mb*10. If hard clipping is used it would be much smaller.

So the latter case.

Easy - try it now. Pull and make modules.
There is an option now --hard-clip to, well, hard clip the alignments :-)