biogo / hts

biogo high throughput sequencing repository

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bam: add Merge function

dongweigogo opened this issue · comments

This package is absolutely a great work of Go for NGS data processing. I'm loving to use it.
Is there an API for sorting bam files, similar to "samtools sort"?

There is no API for sorting. Sorting is provided for small cases in the standard library. For large cases you will need to write out partial sort results to files and do a merge of those at the end. A generic package that does this is available in "github.com/biogo/biogo/morass", but I would suggest writing your own as the genericity will impact on performance.

Thinking about this some more, there is probably a good justification for providing the most difficult and generally useful primitive component of this task: merging bams with a defined sort order. With this addition the general case becomes trivial; just read partial bam, sort and write each and then merge.

I'm thinking something like bam.NewMerger(less func(a, b *sam.Record) bool, src ... *bam.Reader) (*Merger, error) where *Merger is an iterator type that has Read() (*sam.Record, error) and Header() *sam.Header. The less func allows sorting in orders other than queryname and coordinate sorts and would only be used if non-nil and the sort orders are unknown in the input bams (the sort order fields must agree or NewMerger returns an error). This also allows direct interaction with the read stream to do things like read deduplication without having to use a pipe or write out the sorted/merged files.

Does this sound reasonable for your use?

Thanks, that's probably what I want. And it sounds like basd on the Sort package.

@dongweigogo Please take a look at https://github.com/biogo/hts/tree/merger, in particular https://github.com/biogo/hts/blob/merger/bam/merger_example_test.go which shows how the API is used.

Does this fit your use?

This does fit for most of uses, thanks again!
BTW, for some other cases, sorting by reads ID is also needed, I think a sortByID-like function could be added in future. And, go 1.8 introduces a new sort.Slice function that makes sorting more conveniently, I'm wondering if it can benefit this.

The query name sort is already provided here. But the idea is to provide the basics. If you have a sort less function, you can either build a type satisfying sort.Interface or use the new sort.Slice (with it's performance cost - I would avoid it for uses here because of this) and the example I linked to. The reason for providing the example and not a user-facing function is that there are too many options, so I leave that for the user.

I may add *sam.Record.LessByNameAndCoordinate as

func (r *Record) LessByNameAndCoordinate(other *Record) bool {
    if r.Name < other.Name {
        return true
    }
    return r.Name == other.Name && r.LessByCoordinate(other)
}

since it allows a BAM to be sorted by pairs and thus allowing paired-end read deduplication with ease. Though the function above is trivial for users to write themselves given what already exists (really only less by coordinate is needed since it has some SAM spec weirdness).

I will finish up the PR over the next few days and then I should be ready to merge. At the moment, bam.Merger code is not tested.

OK, you're right, for tasks like merging, performance is a big thing to consider.