santiagosnchez / STRlength2repeat

Python code to convert STR length data to number of repetitions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

STRlength2repeat

This program is meant to convert Short Tandem Repeat (STR) length data to number of tandem repetitions. This is specially useful for applying models of microsatellite evolution (Sainudiin et al. 2004) to infer divergence times or demographic models.

The CSV output can be uploaded directly to BEAST v2.6.2 by installing the BEASTvntr package.

Dependencies

Code is written to be functional under Python 2 and 3. However, I suggest using Python 3 to avoid future issues. Other than that, there are no other dependencies.

Installation

wget https://raw.githubusercontent.com/santiagosnchez/STRlength2repeat/master/STRlength2repeat.py

Running the code

python3 STRlength2repeat.py input_file.csv

Input file

The program expects a CSV file with STR data in standard length format. For example:

indiv1,124,124,345,347,233,239
indiv2,122,124,345,345,230,239
...

Here, both alleles for a single locus are contiguous. This means that if there are n STR loci then there will be (n * 2) + 1 columns.

Tandem size

The program includes a function to infer the the size of the tandem, which is based on the difference between minimum allele (A_i - A_min) % m, where m is the size of the tandem. In this case, m can go from 2 to 10.

motif = range(2,11)
mprop = []
for m in motif:
    y = [ (x - min(locus)) % m for x in locus if x != 0 ]
    mprop.append(round(float(sum([ x == 0 for x in y ]))/len(y),2))
shortest = min([ x for x in locus if x != 0 ])
bprop = max(mprop)
besti = [ i for i in range(len(mprop)) if mprop[i] == bprop ]
bestm = max([ motif[i] for i in besti])
return(bestm,shortest,mprop)

A predefined tandem size should be easy to implement by providing it in CSV format. That will be done in future versions.

Function for repeat number

The repeat number function is wrapped around get_repeats() and follows this logic: for i alleles A within locus j we take the difference between the A_ij and the minimum/shortest allele A_minj and divide it by the tandem size m and adding 1:

(A_ij - A_minj) / m_j) + 1

References

Sainudiin R, Durrett RT, Aquadro CF, Nielsen R. Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics. 2004;168(1):383‐395. doi:10.1534/genetics.103.022665 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1448085/

About

Python code to convert STR length data to number of repetitions

License:GNU General Public License v3.0


Languages

Language:Python 100.0%