mcveanlab / treeseq-inference

Work for the tree sequence inference paper.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Save variant positions in .npy file

hyanwong opened this issue · comments

We currently store the variant matrix S (after errors have been added) using np.save(). But this looses the variant positions, which are needed e.g. when doing argweaver inference. I don't want to have to read these in again from the hdf5 file when processing the data (and anyway, they might have changed from the original file, due to integer rounding).

So I think that the variant matrix S should be stored with positions as column (or row) names. But I don't know how to do this efficiently in numpy. I've had a look at https://docs.scipy.org/doc/numpy/reference/generated/numpy.recarray.html but it all seems opaque to me. For example, it seems to be set up to expect different storage types for different "named columns", whereas we definitely just want an single (boolean, as it happens) matrix, simply with one axis with (integer or float) names. I wonder if @jeromekelleher knows the best way to do this in numpy, and store the result?

I've added a discretisation step for coordinates in the last push, which should get rid of the ambiguity. All coordinates from end-to-end should now be in the same space.

It would seem easier to me to read the coordinates from the tree sequence file rather than store them all over again in some other intermediate format. Is there a reason for not doing it this way?

Thanks. I would prefer to keep the files usable independently. For example, in the generate step we take the variant matrices (in .sites, .npy, and .hap format) and convert them to inferred ARGs, for which we need the variant positions. The .sites and .hap format contain the positions, but the .npy one doesn't. If the positions were stored in the .npy file, then the generate step could be run without having to touch the .hdf5 file at all, which seems much cleaner to me.

Also, I foresee a time when we might add mutations to the variant matrix that occur at sites which are not present in the TreeSequence (for example, extra non-tree errors). So retaining a dependency of the .npy file on a .hdf5 file seems fragile.

OK, well the simplest thing to do is to just save an extra file with the positions in it then. I wouldn't bother trying to create a structured array or whatever.

Right. Shall I code this?

Done, saved as XXX.pos.npy vs XXX.npy for the variants. @jeromekelleher how do I inject these positions back into the ts_inferred tree sequence object that is created in run_tzinf? Is this a problem if treeSeq objects are mean to be immutable?

Good question. It's simplest to update tsinf to read in the positions also and keep this state. We'll need it at some point anyway. I've opened a new issue for this.