Myxogast will be a network aligner based on the classic dynamic programming algorithms, and using a novel JSON reference sequence format supporting:
- wildcards ('N' and '*')
- branch points
- HMM-style probabilities
For example, this is a valid graph-JSON reference:
[{"id": "domain_1",
"seq": "*ATGCATGC"},
{"id": "domain_2",
"branch": [{"id": "A",
"seq": "GGCGGC"},
{"id": "B",
"seq": "TTATAG"}]
},
{"id": "domain_3",
"dist": [{"A": 0.5, "T": 0.5}, {"A": 0.1, "T": 0.1, "G": 0.8}]},
"TATTATA",
{"__META__": {"version": 0.1,
"gap-penalty": -1}}]
Broadly, there are 3 node types: sequence fragments, branches, and probability distributions. The simplest valid graph-JSON reference sequence is a simple JSON string, eg "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTG"
, and represents a "fragment" node without an id. Functionally, this is equivalent to {"id": "BA000007.2", "seq": "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTG"}
.
Currently, basic Needleman-Wunsch alignment works, basic graphs are parsed correctly, and some basic form of graph alignment seems to work.
Reference sequence graphs are represented like so:
enum SeqNode {
Nil,
Frag {
id: u32,
val: Sequence,
llocal: bool,
rlocal: bool,
next: Rc<SeqNode>
},
//Dist { id: u32, scores: ProbMatr },
Branch {
id: u32,
llocal: bool,
rlocal: bool,
members: Vec<SeqNode>
}
}
struct SeqGraph {
tree: SeqNode,
names: BTreeMap<u32, String>
}
Nodes are identified by a unique u32 integer. Names from the "id" field in the graph-JSON input are stored in a BTreeMap, and not with the nodes, because (1) names are optional, and (2) not guaranteed to be unique.
- set Frag llocal/rlocal fields based on splats
- add PARAMS block to JSON format
- enforce short-circuiting of alignment (eg, max 3 indels)
- HMM/Viterbi alignmed
- modify Sequence to bit-pack bases (ATGCN- -> 3 bits, or 10 bases per u32) (?? is this worth the effort?)