A search tool for identifying V, D, and J gene segments in Ig and TCR loci. Optionally, supports identification of pseudogenes and scan of local NCBI database.
Project on Hold (1/15/2020) Remaining Tasks:
- implement screening methods for necessary search parameters:
- locus_file
- locus_type
- gene_type
- pref_name
- custom_rules
- implement support for outputing as .fasta file
- implement support for scraping NCBI database
This program has been developed for use in several platforms including command line execution, Jupyter Notebook, and our website. Below are examples of what running the search tool should look like on these platforms.
Method: python search( locus_file, locus_type, gene='ALL', custom_rules=False )
Example: `python3 search( './data/input/IGH_locus.fasta', 'IGH')`
Method: search( locus_file, locus_type, gene='ALL', custom_rules=False )
Example: `python3 search( './data/input/IGH_locus.fasta', 'IGH')`
Example:
`python3 vdjfinder.py -f Loci/IGH_locus.fasta -v IMGT_human/IMGT/IGHV.fasta -d IMGT_human/IMGT/IGHD.fasta -j IMGT_human/IMGT/IGHJ.fasta`
There is a web application in development as of January 2020. Once live, we will update with sample instructions.
There is only one method intended to be public to the user, search(). The remainder of the program performs the search using a rigorously tested set of search criteria and prepares the result for the user.
- Methods
search( locus_file, locus_type, gene='ALL', custom_rules=False )
-
Global values
all_ref_dbs
- NOTE: Nonamer matches >=5, heptamer matches >= 5, and no restriction on sum finds all J genes with no false positives
-
Methods
-
prep_output( gene_file, pseudogenes=False, pref_name=False, force=False )
- Private to v/d/j_gene_search() methods
- preps output file to store search results
- Parameter(s):
gene_file
: file that is being searched, naming convention uses this file name as basepseudogenes
: Boolean value, indicates whether to alter naming convention to include 'pseudogenes'pref_name
: allows out_name to be passed as argumentforce
: Boolean value, indicates whether to force overwrite in the event of duplicate file name
- Return:
- file location+name and mode to use
-
prep_database( locus_type, gene_type )
- Private to v/d/j_gene_search() methods
- prepares local reference database as dictionaries
- Parameter(s):
locus_type
: 'IGH', 'IGL', 'IGK', 'TRA', or 'TRB'gene_type
: 'V', 'D', or 'J'
- Return:
- dict of gene sequences and dict of gene types
-
prep_frame( nt )
- Private to v_gene_search() method
- Description TBD
- Parameter(s):
nt
: nucleotide sequence to operate on
- Return:
- amino acid frame
-
There is certain data which is vital for the search method's ability to remain accurate and is based on public data found on The National Center for Biotechnology Information website. While we use a local version of this data in the program, we intend for this to be kept as up-to-date as possible. Please report this issue if you find this to not be the case.
Python (version 3.6)
- Use: This program was written entirely in Python and cannot be executed outside of either an environment that supports Python or a third party environment designed to bridge this program with another platform (ie. web application). Bio (version 1.66)
- Use: The Bio package is vital to this program's ability to read and write .fasta files, as well as the initial handling of each DNA sequence. Pickle (version )
- Use: During the search for V, D, and J genes, the program attempts to weed out false positives (AKA, "pseudogenes"). In order to continuously improve the accuracy of the program and further research this topic, when the program sorts out these genes, it collects the pseudogene(s) and pickles the collection for later investigation. While the goal of this collection is to increase knowledge of immunoglobulin genes across the board, this feature can be disabled by adding 'pseudogenes=False' as an argument to the initial call to search().
If you encounter any problems while using this program, please report the bug to the developers. Additionally, contact the developers with any questions, comments, or problems running this tool.
This tool is a collaborative effort from researchers at the San Diego Supercomputer Center (SDSC) and Vanguard University.
- Bob Sinkovits, Ph.D. Director of Scientific Computing Applications, SDSC
- Bailey Passmore, Undergraduate Student, Computational and Data Science Researcher, SDSC
- Additional names to be added
Copyright 2019 San Diego Supercomputer Center