biogo / hts

biogo high throughput sequencing repository

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sam: renaming references, read groups and programs

kortschak opened this issue · comments

The work on the bam.Merger API has exposed some limitations of the Header API. The limitations are not harmful in the normal consume-a-bam-or-sam-file-to-read-some-data kind of use, but impact on the ability to mutate bam and sam files in an efficient way. This kind of mutation is potentially needed when merging bams.

Background

There are two broad cases for merging:

  1. re-merging a collection of sub-sorted bams to get an overall sort and
  2. joining together a collection of files from distinct origins.

In the first case the headers of the input files all match and so there is no work to do. In the second case references, read groups and programs may need to be added to the header, and since read groups and programs must have unique identifiers (ID field in both) collision must be handled - this may be either accepting that the read group or program is already recorded or that the identifiers must be de-colided. We cannot know which is correct in general, so we should leave this up to the user.

The way the code I have a present (not yet merged into the merger branch) is to blithely rename all read groups and programs to "<old-id>|<n>" where <old-id> is the previous ID and <n> is index into the list of headers that are merged. It is then up to the user to delete all the read groups/programs that are not needed, and accept that "<old-id>" is unrecoverable because we do not allow name changes in any sensible kind of way. This is horrible.

What I'm proposing here is that users should be able to change the names of read groups, programs and, while we are here, references. Currently there is no way to add this API to these types given the structure of the types representing these data; changing the name of one of these will break an invariant that is depended on. There are two possible approaches.

  1. Add a name changing API to Header in the form of RenameX(old, new string) error where X is {Reference|ReadGroup|Program} and non-nil errors are returned if old doesn't exist or new already does, or
  2. Add a name setting API to each type in the form of SetName(n string) error where a non-nil error is returned if a name n already exists.

The first option adds noise to the API and ties behaviour that seems like it should belong to the type to Header instead. The second option requires that the types know who owns them (this is allowed in the current invariants) by adding a *Header field to each type so that that the owner's seen maps can be updated.

@brentp Do you have a preference? I am leaning toward option 2.

I agree 2 seems better.
What about the API of SetName(n string, h *Header) error ?
Seems to have obvious benefits, e.g. if h is nil then there's no check and the user has full control of passing around the struct to any Header as appropriate.

There is already information in each of the types that ties them to an individual *Header. Allowing the user to lie about who owns the type is just asking for bug reports.