matsengrp / linearham

A Bayesian Phylo-HMM for B cell receptor sequence analysis

Home Page:http://matsengrp.github.io/linearham

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Designing a Phylo object

matsen opened this issue · comments

I'd like to start getting to the nitty-gritty of how the objects will look for our phyloHMM implementation.

First, let's assume that there is a "Joe" class (named in honor of Felsenstein) that wraps all of the libpll data structures.

I propose a "Phylo" object:

  • Contains a Joe object with the current tree and the complete MSA (with all the germline bases that could align to the query sequences)
  • Contains information about the germline genes under consideration and their alignments to the MSA
  • Have methods such that we can ask for per-site likelihoods (and their derivatives?) for a specific gene

The idea then is that to calculate a marginal likelihood, we set up the corresponding Smooshables with the per-site likelihoods calculated by the Phylo object, then run the usual marginal likelihood algorithm. We can do whatever we want to the Joe object (modify tree and branch lengths, etc) and ask for such a re-calculation. This way the germline-encoded Smooshables only need to know what gene and range of that gene they represent; NTI Smooshables know what range of the MSA they represent.

I also propose that all Smooshishs have a dirty flag. In this way the calculation could proceed:

  1. Modify tree somehow
  2. Tell the Joe to update itself
  3. Mark every Smooshish in the Pile (recurring up through Chains, etc) as dirty
  4. Recalculate. If a Smooshable is dirty, it asks for its per-site likelihoods from the Phylo object, and if a Chain object is dirty it just recalculates based on its prev and curr (asking them to recalculate as well, which they do if they are dirty).

I haven't carefully thought out what to do with indels.

A way of compressing sites of an alignment:

x-001

A possible way forward:

x-000

As per our discussion about alignment compression, I think we can get away with using a STL map in the following way:
phylo storage
For any germline, we can input a (base, rate, MSA pos.) triple into the map and extract the xMSA likelihood index. We can avoid all the complicated loops needed to compute the v_i vectors shown above, we'll just have to loop through the bases and rates vector contained in the PhyloGermline object.

(Although not sure if libpll utilizes per-site rate scalers the way we want)

Yes, this looks good as a way to build things up, and it might be simpler. However, I would advocate this map as an intermediate step-- in the end we are going to want to be able to specify the gene and quickly get all of the indices of the xMSA that correspond to that gene. In the design you propose that's going to require O(n) fetches from the map. This probably isn't a big deal, but it would seem tidier to get the v_i vectors once and store them.

Also, can you provide a little more detail about how the xMSA is going to be built up? Once the map is there, we know how long it's going to be and then can copy the corresponding information in. It would be nice to have an "ordered hash map" for this, though that doesn't appear in the STL so perhaps keep two data structures around: the xMSA map and a vector with the keys that have been added in order (just push_back) every key.

Regarding rates, I asked about per-site rates on the Stamatakis lab Slack in the summer and they said it was on the roadmap, but we can't wait for it here. I propose using Gamma distributed rates for now (no per-site rates), but writing the code in such a way that we can use per-site rates when that's available. So how are we going to generalize over these two cases? We could use templates (simple examples here), but I'd be happy to hear something simpler.

And as we discussed, it's be great to sketch out how we are going to abstract this indexing code so that it's not all closely woven into the rest of the code and can be tested separately.

With regards to the OOP design of EmissionData, I'm going to abstract it a little and see if we can get some feedback from @bcclaywell . I think we can use a static cast.

In this case imagine we have two groups of classes: D for data and P for processing. The role of P objects is to process D objects.

Now, each of these groups of classes comes in two flavors: D1 and D2, then P1 and P2. P1 objects are meant to process objects of type D1, and likewise for P2 and D2. There is some commonality between each of these groups, so we'd like for P1 and P2 to inherit from a common class P, and likewise for the D objects. P then will have a method f that takes in an object of type D. There is some processing, however, that is class-dependent, which is done by calling another method g, the behavior of which differs between P1 and P2. Thus, I'd think that g would be implemented differently in the two subclasses P1 and P2.

However, we will be calling g from f that only knows the data as a type D. My thought is that g can take an object of type D, but then it can do a static cast to the appropriate type: P1 will cast to a type D1, and P2 to a type D2. I think this will work? If we have an object of class P1, we can invoke f on something of type D1, which will get ingested by f as something of class D and then statically casted back to D1 by g.

The other design that I proposed upthread is to just have an abstract class X, with subclasses X1 and X2 that takes the needed input for a D and one of the P variants.

E.g. the constructor for an X1 would take a P1 and then the input required for a D1. I think this would be simpler.

In this case the X would be a Emission class as in the picture.

  • The SimpleEmission constructor would take a Eigen::VectorXi and a SimpleGermline
  • The PhyloEmission constructor would take a Phylo object and a PhyloGermline

This way the two types would be tied together nicely. Each would implement an EmissionVector with the same signature, which is enabled because the data is already in there-- all that's needed for the call is some indices.

@dunleavy005 -- can you have a think about this before Brian has to read my whole dialog above? I think this is the simpler way forward.

Re: Map - I agree, we can store the v_i vectors as we initially loop over germlines and MSA positions in the map creation stage, then we'd never have to access the map again, outside of creating the actual xMSA.

Re: xMSA Build-up - Yes, completely forgot that if we added an element (i.e. [{A, r1}, 0]) into the map, we're not guaranteed it'll be in position 0. So yeah, I agree we should just keep the ordered vector of keys, build the xMSA column-by-column from that, and then have the xMSA map available for efficient index look-up (instead of using .find() on the key vector, which is O(n)).

Re: rates - I'll have to think a little more about rates. Off the top of my head, gamma-distributed rates is just a linear combination of "\sum_r r * P(data | r)", where "r" is the rate picked from the discretized gamma approximation for a given shape parameter. Per-site rates could fit in here, by summing over the single "r" picked (i.e. a vector of 1 rate, instead of a vector of 4 rates). Potentially each MSA site could have a vector of rates associated with it, and in our case we could make them vectors of 1?

Re: EmissionData Design 1 - I'm a bit hesitant to static cast, without any flags asserting it's correct. I used static cast in the GermlineGene class, but I carry a type string that ensures I'm static casting properly, and I think here it seems a bit hacky to carry a type string around. (it's kind of what I'm doing already) (Another option is dynamic casting, but that has extra overhead, and probably not worth it?)

Re: EmissionData Design 2 - I think the way you framed it, it seems we'll have an BaseEmission object per BaseGermline object, wouldn't it be better to just let BaseEmission::EmissionVector take in a Germline object as input and only let the constructor take in "data" (i.e. Eigen::VectorXi or Phylo)?

If we assume we want one Emission object, I think the approach now vs. presented here are reciprocals of each other. For instance, I believe you're proposing something like BaseEmission::EmissionVector(BaseGermline, indices). Right now, we (essentially) have BaseGermline::EmissionVector(BaseEmission /* changed from EmissionData */, indices).

To me, it doesn't really matter, but since all the Match Matrix stuff exists within Germline, why not keep it there? I'm most likely misinterpreting what you're proposing, but I think you want an interface that only takes in indices and spits out an emission probability vector, and it seems like we can't do that unless we copy all the Germline object pointers into separate BaseEmission objects?

Re rates: you are right that mathematically we should be able to extract this. It's just not conveniently exposed by libpll.

Re: EmissionData Design 1: I think that you are right, but I'll think about it tomorrow.

Re: EmissionData Design 2: sticking with the idea of one Emission object per Germline object, don't we need something like that anyway for #31?

Yes, I believe it came up (via #32) (unique Germline <----> unique Smooshable) , as we needed some means of obtaining the viterbi germline bases from viterbi_idx indices (i.e. we need Smooshables to remember what Germline it came from). But here, we'd have multiple Emission objects that'd consist of (duplicated) data pointers and the different Germline pointers.

If I'm not mistaken, the reason for putting a Germline pointer in Emission is to have something like:

Emission::EmissionVector(indices);

without having to explicitly specify Germline as an argument to have a simple interface? Was there another reason for the multiple Emission objects?

Re motivation, the main issue is not the interface as you describe in the previous comment, it's because I don't like basically re-implementing dynamic casting with a string comparison. Having to ask what type something is every time we want to call a member function is just painful.

I think a good solution is described in this blog post. Take a look at "Multiple dispatch in C++ with the visitor pattern".

The idea applied here would be that we have a pure virtual Emission class (analogous to the EmissionData class) with two subclasses, SimpleEmission and PhyloEmission, with their relevant data. Now, analogous to the implementation of the Intersect method in the example, EmissionVector is implemented in each of the subclasses of Germline so that it takes the corresponding type of Emission: SimpleGermline::EmissionVector takes a SimpleEmission, for example. Mixing types leads to a failure.

I'll just write out the motivation for this here to work it out on my head. We have a general concept of a germline sequence, and there is some computation that happens there that is general to either a phyloHMM and a simple HMM. However, some details differ in computation. We implement those differences as virtual methods that get defined in subclasses. In particular, the differences take different data, but because these method are being called in the general Germline class then we have to be able to specialize the method calls to the type of data.

I think this seems reasonable, but I'll keep thinking on it.

(Very minor point: I suggest Emission rather than BaseEmission because we give the special sub-cases their own names. Same for BaseGermline.)

Two more notes:

We can replace this with a single method, say Emission::Length:

  // Compute the length of the read (SimpleGermline) or MSA (PhyloGermline).
  int seq_size;
  if (emission_data.data_type() == "simple") {
    seq_size = emission_data.simple()->size();
  } else {
    assert(emission_data.data_type() == "phylo");
    seq_size = emission_data.phylo()->msa().cols();
  }

Second, do we need a Phylo class independent of the PhyloEmission class as described?

Also, what are we going to do with NTIs?

Conclusion: just have one Germline object, which has everything from SimpleGermline and PhyloGermline data.

EmissionVector(emission_data, relpos, match_start, emission); will become emission_data.EmissionVector(relpos, match_start, emission);.

YAY!!!

Just to review (should probably go in a readme or diagram):

  • Queries house sequence information and germline alignment information
  • Emissions use sequence information and germline alignment information to create their data structure
  • Smooshables abstract away the sequence information into something that's just about probabilities.

Open question: can we eliminate Queries in favor of Emissions?

@dunleavy005 -- We should also work to define what API we want for the Joe object. The sooner we can have this, the happier Brian will be if we ask him for other fancy methods. So this should have pretty high priority.

I assigned myself to think about indels in the alignment. Seems to me

  • if there is an insertion relative to germline, then we will have gaps in the germline sequence. We should just treat these as unknowns, and marginalize out at the root. This is the same as what we are planning for the NTI region, but without the fancy HMM.
  • if there is a deletion relative to germline, then we should insert gaps into the MSA and do the usual likelihood computation (noting that we will only need one such column). So @dunleavy005 -- is there anything that would keep us from having the v_i have repeated elements?

An example of Pile updating (sanity check)

(1) {{{{{A1, B1}, C1}, D1}, E1}, F1}
(2) {{{{{A1, B1}, C1}, D2}, E2}, F2}
(3) {{{{{A1, B1}, C1}, D1}, E2}, F2}
(4) {{{{{F1, B1}, D1}, D2}, A1}, B2}

-Loop through (1)-(4), and mark all Smooshishs dirty.

(1)

-Extract A1-F1 SmooshablePtrs into a vector and update them (b/c they dirty).
-Mark those SmooshablePtrs as clean (because they updated) and the associated ChainPtrs (i.e. {A1,B1}, {{A1,B1},C1}, etc.) as clean (because they depend only on A1-F1 being clean).

(2)

-Notice the inner Chain {{A1, B1}, C1} is clean from (1).
-Now we only extract D2-F2 SmooshablePtrs into a vector and update them.
-Mark D2-F2 as clean (b/c they updated) and the 3 outer Chains as clean (b/c {{A1, B1}, C1} is already clean).

(3)

-Notice the inner Chain {{{A1, B1}, C1}, D1} is clean from (1) AND E2,F2 are clean from (2).
-Because the 2 outer Chains are dirty, we will look for dirty SmooshablePtrs (but will fail to find any b/c E2 and F2 are clean)!
-Of course, we must mark the 2 outer Chains as clean (even though they were technically clean when we started (3))!

(4)

-None of the Chains are clean because we haven't encountered them ever (but all Smooshables except B2 are clean).
-We do a full traversal looking for dirty Smooshables (only finding B2).
-We update B2 as usual.
-We mark all the Chains clean (starting at the outer-most Chain, going inward), then B2 is marked clean.

DONE

:woot:

closed by #43 .