PrecisionProDB (Precision protein database), a tool improving the proteomics performance for precision medicine.
PrecisionProDB is a Python package for proteogenomics, which can generate a customized protein database for peptide search in mass spectrometry.
- PrecisionProDB
- Description
- Installation
- Citing PrecisionProDB
- Usage Information
- Outputs
- Benchmark
- PrecisionProDB_references
- Contact Information
The major goal of PrecisionProDB is to generate personized protein sequences for protein identification in mass spectrometry (MS). Main features:
- Supports multithreading, which improves the speed of the program. A typical customized human protein database can be generated in 15 to 20 mins using 8 threads.
- Optimized for several widely used human gene models, including:
- GENCODE: PrecisionProDB can download the latest version from
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human
, as shown inhttps://www.gencodegenes.org/human/
. - RefSeq: PrecisionProDB can download the latest version from
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/current
- Ensembl: PrecisionProDB can download the latest version from:
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/pep/
- UniProt: PrecisionProDB can download the latest version from:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/
- The files are
UP000005640/UP000005640_9606.fasta.gz
andUP000005640/UP000005640_9606_additional.fasta.gz
, which may change in the future.
- GENCODE: PrecisionProDB can download the latest version from
- The non-standard codons and rare amino acids (e.g. Selenocysteine (Sec or U)) in the human genome can be properly incorporated.
- Internal stops (*) in proteins were reserved.
- Supports variant file in text or VCF format.
- All input files can be in compressed gzip (.gz) format.
- Supports user generated gene models in GTF/GFF format. Species other than human are also supported.
- For user-generated GTF files, protein annotations generated by TransDecoder was tested.
- We provided an example of running TransDecoder with example files.
The figure below shows how PrecisionProDB works:
PrecisionProDB is tested under the base
enviroment of Anaconda. It requires Python3, Biopython and Pandas.
If Anaconda is installed, only Biopython need to be installed:
conda install -c anaconda biopython
Otherwise, it is recommended to use the conda to control the packages and virtual environement. Install required packages:
conda install numpy
conda install pandas
conda install -c anaconda biopython
If conda
is not installed, pip
(or pip3
as Python3 is required) can be used. pip
is already installed if you are using Python3 >=3.4 downloaded from python.org.
pip3 install numpy
pip3 install pandas
pip3 install biopython
If the user has no root previlige on the system, the packages can be installed using the "--user" option:
pip3 install numpy --user USER
pip3 install pandas --user USER
pip3 install biopython --user USER
USER
is the user name on the operating system to install these packages.
To install the latest developments:
git clone https://github.com/ATPs/PrecisionProDB.git
To install other verisons, download from the release page directly.
Xiaolong Cao, Jinchuan Xing, PrecisionProDB: improving the proteomics performance for precision medicine, Bioinformatics, Volume 37, Issue 19, October 2021, Pages 3361–3363, https://doi.org/10.1093/bioinformatics/btab218
Note: python
in the example scripts below are Python3. If you are unsure about the version of your python, use python --version
to show the version. In some systems you might need to use python3
to specify Python3, or use the full name of Pythons
(e.g., /home/xcao/p/anaconda3/bin/python3.7
), if multiple versions of Python
exist in the system or Python
is not in the system PATH.
We suppose that in most cases, users will have a variant file in VCF format. If there is only one sample in the VCF file, the simplest command will be:
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output
-m Name_of_variant_file
defines the input variant file (include the full path if the input file is not in the current folder). If the variant file ends with '.vcf' (case ignored), it will be treated as a VCF file. In all other cases a file will be treated as a TSV file. Files end with '.gz' (e.g., '.vcf.gz' or '.tsv.gz') will be treated as gzip compressed files.-D GENCODE
defines the annotation reference to be used. In this example, personalized protein sequences based on theGENCODE
annotation will be generated. PrecisionProDB will download required files of GENCODE models automatically. To use gene models in other supported resources,GENCODE
could be changed toRefSeq
,Ensembl
orUniprot
.-o Prefix_of_output
defines the prefix of the output filenames.
If there are multiple samples in the VCF file, the -s
option should be used to specify the sample name to be used in the VCF file.
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output -s Sample_name
If there is a local version of gene annotation files from Ensembl, the command will be:
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -o Prefix_of_output -s Sample_name -g Ensembl_Genome -p Ensembl_protein -f Ensembl_gtf -a Ensembl_GTF
Ensembl_Genome, Ensembl_protein, and Ensembl_gtf are the locations of the Ensembl genome, protein, and GTF files, respectively. These files can be downloaded from Ensembl website as metioned previously, or use the downloadHuman
module in the package.
python Path_of_PrecisionProDB/src/downloadHuman.py -d Ensembl -o Output_folder
Output_folder is the path of output folder to store the downloaded files.
If the variant file is in the tab-separated values (TSV) format,
-
it needs to include a header row, with at least four columns:
chr
,pos
,ref
,alt
. There is no requirement for the order of these columns, aspandas
was used to parse the file. -
additional columns are allowed, but will be ignored.
-
the
chr
,pos
,ref
andalt
columns were coded in the VCF format. This means that for deletions, it should be written aschr1 10146 AC A
, rather thanchr1 10147 C .
. Also, thepos
is 1-based like in the VCF file, not 0-based (in bed file). -
The most simple text file looks like:
chr pos ref alt 1 10146 AC A 1 15274 A G 1 28563 A G 1 49298 T C 1 52238 T G
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -D GENCODE -o Prefix_of_output
Name_of_variant_file
is the name of the variant file. If the variant file ends with '.vcf' (case ignored), it will be treated as a VCF file, as described above. In all other cases a file will be treated as a TSV file. Files end with '.gz' (e.g., '.vcf.gz' or '.tsv.gz') will be treated as gzip compressed files.- For text file format input,
-s
option will be ignored as there is only one sample. - Here,
-D
is set to beGENCODE
. GENCODE related files will be downloaded.
We tested GTF annotation generated by TransDecoder.
Run TransDecoder in the starting from a genome-based transcript structure GTF file mode.
python Path_of_PrecisionProDB/src/PrecisionProDB.py -m Name_of_variant_file -o Prefix_of_output -s Sample_name -g TransDecoder_Genome -p TransDecoder_protein -f TransDecoder_gtf -a gtf
cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m celline.vcf.gz -g GENCODE.genome.fa.gz -p GENCODE.protein.fa.gz -f GENCODE.gtf.gz -o vcf_variant
Five files will be generated in the examples
folder.
vcf_variant.pergeno.aa_mutations.csv
: annotations of amino acid changes.vcf_variant.pergeno.protein_all.fa
: all proteins after incoporating the variants.vcf_variant.pergeno.protein_changed.fa
: all proteins which are different from the input protein sequences after incoporating the variants.vcf_variant.vcf2mutation_1.tsv
: variant file extracted from the VCF file in text format, the first alternative alleles.vcf_variant.vcf2mutation_2.tsv
: variant file extracted from the VCF file in text format, the second alternative alleles.
Note:
- For altered proteins,
__1
,__2
,__12
will be added to the ID of the protein.__1
and__2
mean that the alleles of the protein is from the first and the second variant file, respectively.__12
means that the the altered protein sequence are the same for the first and the second alleles.- e.g.,
>ENSP00000308367.7|ENST00000312413.10|ENSG00000011021.23|OTTHUMG00000002299|-|CLCN6-201|CLCN6|847__12 changed
,ENSP00000263934.6|ENST00000263934.10|ENSG00000054523.18|OTTHUMG00000001817|OTTHUMT00000005103.1|KIF1B-201|KIF1B|1770__2 changed
,ENSP00000332771.4|ENST00000331433.5|ENSG00000186510.12|OTTHUMG00000009529|OTTHUMT00000026326.1|CLCNKA-201|CLCNKA|687__1 changed
,ENSP00000493376.2|ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094|OTTHUMT00000003223.1|OR4F5-202|OR4F5|326 unchanged
.
- The variant file looks like
chr pos ref alt chr1 52238 T G chr1 53138 TAA T chr1 55249 C CTATGG chr1 55299 C T chr1 61442 A G
cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m gnomAD.variant.txt.gz -g GENCODE.genome.fa.gz -p GENCODE.protein.fa.gz -f GENCODE.gtf.gz -o text_variant
Three files will be generated in the examples
folder.
text_variant.pergeno.aa_mutations.csv
: amino acid change annotationstext_variant.pergeno.protein_all.fa
: all proteins after incoporating the variants.text_variant.pergeno.protein_changed.fa
: all proteins which are different from the input protein sequences after incoporating the variants.
Note:
- Protein names and descriptions in the fasta file are the same as in the input protein file, and the
Tab
symbol (\t
) +changed
orunchanged
were added to indicate if the protein sequence is altered. - e.g.,
ENSP00000328207.6|ENST00000328596.10|ENSG00000186891.14|OTTHUMG00000001414|OTTHUMT00000004085.1|TNFRSF18-201|TNFRSF18|255 unchanged
,ENSP00000424920.1|ENST00000502739.5|ENSG00000162458.13|OTTHUMG00000003079|OTTHUMT00000368044.1|FBLIM1-210|FBLIM1|144 changed
.
There are several files in the src
folder. Each of them were designed in a way that can be run independently. To get help, run
python Path_of_PrecisionProDB/src/module_name.py -h
To get help for the main program, run
python Path_of_PrecisionProDB/src/PrecisionProDB.py -h
The following messages will be printed out.
usage: PrecisionProDB.py [-h] [-g GENOME] [-f GTF] -m MUTATIONS [-p PROTEIN] [-t THREADS] [-o OUT] [-a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}]
[-k PROTEIN_KEYWORD] [-F] [-s SAMPLE] [-A] [-D {GENCODE,RefSeq,Ensembl,Uniprot,}] [-U UNIPROT] [--uniprot_min_len UNIPROT_MIN_LEN]
PrecisionProDB, a personal proteogenomic tool which outputs a new reference protein based on the variants data.
A VCF or a tsv file can be used as the variant input.
If the variant file is in tsv format, at least four columns are required in the
header row: chr, pos, ref, alt. Additional columns will be ignored. Convert the file to proper format if you have a bed file or other types of variant file.
optional arguments:
-h, --help show this help message and exit
-g GENOME, --genome GENOME
the reference genome sequence in fasta format. It can be a gzip file
-f GTF, --gtf GTF gtf file with CDS and exon annotations. It can be a gzip file
-m MUTATIONS, --mutations MUTATIONS
a file stores the variants. If the file ends with ".vcf" or ".vcf.gz", treat as vcf input. Otherwise, treat as TSV input
-p PROTEIN, --protein PROTEIN
protein sequences in fasta format. It can be a gzip file. Only proteins in this file will be checked
-t THREADS, --threads THREADS
number of threads/CPUs to run the program. default, use all CPUs available
-o OUT, --out OUT output prefix, folder path could be included. Three or five files will be saved depending on the variant file format. Outputs include the
annotation for mutated transcripts, the mutated or all protein sequences, two variant files from vcf. {out}.pergeno.aa_mutations.csv,
{out}.pergeno.protein_all.fa, {out}.protein_changed.fa, {out}.vcf2mutation_1/2.tsv. default "perGeno"
-a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}, --datatype {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}
input datatype, could be GENCODE_GTF, GENCODE_GFF3, RefSeq, Ensembl_GTF or gtf. default "gtf". Ensembl_GFF3 is not supported.
-k PROTEIN_KEYWORD, --protein_keyword PROTEIN_KEYWORD
field name in attribute column of gtf file to determine ids for proteins. default "auto", determine the protein_keyword based on datatype.
"transcript_id" for GENCODE_GTF, "protein_id" for "RefSeq" and "Parent" for gtf and GENCODE_GFF3
-F, --no_filter default only keep variant with value "PASS" FILTER column of vcf file. if set, do not filter
-s SAMPLE, --sample SAMPLE
sample name in the vcf to extract the variant information. default: None, extract the first sample
-A, --all_chromosomes
default keep variant in chromosomes and ignore those in short fragments of the genome. if set, use all chromosomes including fragments when
parsing the vcf file
-D {GENCODE,RefSeq,Ensembl,Uniprot,}, --download {GENCODE,RefSeq,Ensembl,Uniprot,}
download could be 'GENCODE','RefSeq','Ensembl','Uniprot'. If set, PrecisonProDB will try to download genome, gtf and protein files from the
Internet. Download will be skipped if "--genome, --gtf, --protein, (--uniprot)" were all set. Settings from "--genome, --gtf, --protein,
(--uniprot), --datatype" will not be used if the files were downloaded by PrecisonProDB. default "".
-U UNIPROT, --uniprot UNIPROT
uniprot protein sequences. If more than one file, use "," to join the files. default "". For example, "UP000005640_9606.fasta.gz", or
"UP000005640_9606.fasta.gz,UP000005640_9606_additional.fasta"
--uniprot_min_len UNIPROT_MIN_LEN
minimum length required when matching uniprot sequences to proteins annotated in the genome. default 20
--PEFF If set, PEFF format file(s) will be generated. Default: do not generate PEFF file(s).
Notes
-p PROTEIN, --protein PROTEIN
is a file with proteins matching the GTF file provided!-k PROTEIN_KEYWORD
is a keyword used to match the GTF file and the protein sequences. If not provided, the program will try to determine the keyword based on the datatype. The program needs the data to know the location of proteins in the genome, and codon matches to allow non-standard codons.-a {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}, --datatype {GENCODE_GTF,GENCODE_GFF3,RefSeq,Ensembl_GTF,gtf}
should be set if you use the format above. For "gtf" format, thePROTEIN_KEYWORD
andPROTEIN
should match.
For more information, visit the wiki page. https://github.com/ATPs/PrecisionProDB/wiki
The number of altered proteins will be shown during running PrecisonProDB. In the header line of "PREFIX.pergeno.protein_all.fa", a word "changed" or "unchanged" is at the end of the fasta header, and users may count the number of changed proteins based on this annotation.
Generally, users may found annotations for variants in the "PREFIX.pergeno.aa_mutations.csv" file. Users may get the effects of different variants including AA subsitutions, insertions, deletions, stop-loss, stop-gain, and frame-changes.
Users may use tools like https://github.com/pwilmart/fasta_utilities to further compare the difference of trypsin digested peptides.
Tested with a computing node with Intel Xeon CPU E5-2695 v4 @ 2.10GHz and 256GB memory, with GENCODE gene models and a variant file in text format from gnomAD 3.1 as input.
- Depending on the available resources, a
thread
of 8 to 12 is recommendded. - If the variant file is in text format, typical running time will be 15 to 20 minutes
- If the variant file is in VCF format, typical running time will be 30 to 40 minutes.
The Genome Aggregation Database (gnomAD) project, provide variant allele frequencies in different populations based on genomes and exomes of hundreds of thousands of individuals and this information can be integrated into a protein database. We applied PrecisionProDB to alleles from different populations from gnomAD 3.1 data. Results can be found at https://github.com/ATPs/PrecisionProDB_references.
Please leave comments on the issue tab.