MurrellGroup / NextGenSeqUtils.jl

NextGenSeqUtils:

Home Page:https://murrellgroup.github.io/NextGenSeqUtils.jl/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request]: NextGenSeqUtils.deMux(): Seperate error by gene specific region of primer and barcode region of primer

dnk8n opened this issue · comments

Terminology:

Since I am new to bioinformatics, I am going to go over the terminology as I understand it.

example (dummy data), forward primer:

GATAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
*****------------------------------

where * is the barcode (differentiates samples)
where - is the gene-specific region (selects which region gets amplified)

Currently error is calculated over the entire primer (barcode + gene-specific region).

It would be nice to be able to have the option to enter forward/reverse primers seperated into regions, e.g. instead of:

forwardPrimers = [
    "GATAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT",
    "TCTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT",
    "CATCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTT",
    "TGCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"
]
reversePrimers = [
    "CTGTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
    "CTGTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
    "CTGTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
    "CTGTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
]

we could pass in:

forwardPrimers = [
    ["GATAG", "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"],
    ["TCTAC", "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"],
    ["CATCA", "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"],
    ["TGCAT", "TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"]
]

reversePrimers = [
    ["CTGTC", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"],
    ["CTGTC", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"],
    ["CTGTC", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"],
    ["CTGTC", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"]
]

forwardNames = ["1", "3", "4", "5"]

reverseNames = ["A", "A", "A", "A"]

sequences = [
    "TGCATTTTTTTTTTTTTTTTGGGGGGTTTTTTTTTAAAAATTGGGGGGTTTTTTTTTTTTTTTTTTTTTTGACAG",
    "CTGTCAAAAAAAAAAAAAAAAAAAAAACCCCCCAATTTTTAAAAAAAAACCCCCCAAAAAAAAAAAAAAAATGCA"
]

and get back an output like this:

[
    ["5_A", "TGCATTTTTTTTTTTTTTTTGGGGGGTTTTTTTTTAAAAATTGGGGGGTTTTTTTTTTTTTTTTTTTTTTGACAG", false, [0, -6.0]],
    ["5_A", "TGCATTTTTTTTTTTTTTTTGGGGGGTTTTTTTTTAAAAATTGGGGGGTTTTTTTTTTTTTTTTTTTTTTGACAG", true, [0, -6.0]]
]

This way we can accept a higher error in the gene-specific region, but can still demultiplex according to the barcode error (with a lower tolerance for error).

One step further, it might be nice to just return all error and let the user make decisions based on it, e.g.:

[
    <forward barcode error>,
    <forward gene-specific region error>,
    <reverse barcode error>,
    <reverse gene-specific region error>
]

As I said, pretty new to this. Something I think would be useful. Happy to have a conversation and be put onto alternate solutions.

I am happy to dig into some Julia and work on this if it is decided that this would be a useful feature.

I agree that allowing different thresholds for barcode vs rest-of-primer is useful, because it lets you tolerate errors in the long barcode region. Note that how I currently work around this issue is I don't include all of the non-barcode part of the primer - I include just enough bases to distinguish FWD and REV, which also lowers the computational cost of the demux step, because it requires shorter alignments.

There are other benefits of matching the entire primer though. The main one is that we often want to trim off the primer itself, which is something I haven't done robustly. I'd recommend considering this together with the multiple-region primer match score, because you need info from the primer match to get the trimming right.

Please jump in. You'll likely want to build it around the IUPAC_nuc_nw_align in demux.jl. - you could do a single alignment, and then compute the scores for the different bits based on that alignment. You'll have to make an arbitrary decision about where to place an insertion that happens between the separated segments of the primer.