LashaBuxo/Genome-Tools

Genome Working Tools

Python project is a multi-functional BioInformatics tool, which simplifies calculations on genome annotations provided by Ensembl or RefSeq (NCBI).

Instruction to use

The project requires desired annotations (GFF format), sequence files (FASTA format) or other required files for calculations to be imported into the project and declared in worker_genome_values.py file.

Project supports annotations and sequence files format provided from RefSeq (NCBI) or Ensembl. In addition to adding appropriate files, specific organism name must be declared in worker_genome_values.py as well, for which files are added.

Note: NCBI .gff files lacks UTR 5' and UTR 3' features. In order to add these features run script add_ustrs_to_gff.py after that use any converter tool to change the text format to the ANSII Unicode. add_ustrs_to_gff.py is written by "David Managadze" (works at NCBI) and this script was suggested for use in NCBI's readme files.

Genome Load

from worker_genome import *

genome = GenomeWorker(SPECIES.Homo_sapiens,  # For which organism genome must be loaded?
                      ANNOTATIONS.ENSEMBL,  # Which annotation to use?
                      ANNOTATION_LOAD.GENES_AND_TRANSCRIPTS,  # What type of features are required from annotation?
                      SEQUENCE_LOAD.NOT_LOAD)  # Should sequence also be loaded?

Sequence Analyzer Usage

from worker_analyzer import *

analyzer = AnalyzerData()

analyzer.analyze_sequence_stats("CCAGCAGCAG",  # Nucleotide Sequence
                                1)  # ORF for sequence
print(analyzer.analyzed_peptide)  # Output: 'QQQ' 
analyzer.analyze_sequence_stats("ATGATG", 0)
print(analyzer.analyzed_peptide == 'QQQMM')  # Output: 'QQQMM'
print(analyzer.get_gc_content())  # Output: 0.64285714285

Example Usage

Calculate protein-coding genes greater than 10000 (base pair) length:

total_genes = 0
genes_greater_1000000 = 0

for chr_id in range(1, genome.chromosomes_count() + 1):
    genes_cnt = genome.genes_count_on_chr(chr_id)
    total_genes += genes_cnt
    for i in range(0, genes_cnt):
        gene = genome.gene_by_indexes(chr_id, i)
        genes_greater_1000000 += 1 if gene.end - gene.start + 1 > 10000 else 0
print(f'{genes_greater_1000000} genes from {total_genes}')

14227 genes from 19176

How many RNA transcript starts with ATGGGG in human?

genome = GenomeWorker(SPECIES.Homo_sapiens, ANNOTATIONS.ENSEMBL,
                      ANNOTATION_LOAD.GENES_AND_TRANSCRIPTS_AND_CDS, SEQUENCE_LOAD.LOAD)

count = 0
for chr_id in range(1, genome.chromosomes_count() + 1):
    genes_cnt = genome.genes_count_on_chr(chr_id)
    for i in range(0, genes_cnt):
        gene = genome.gene_by_indexes(chr_id, i)
        transcript = genome.get_transcript_from_gene_by_criteria(gene.id, criteria=TRANSCRIPT_CRITERIA.LONGEST_CDS,
                                                                 tie_breaker_criteria=TRANSCRIPT_CRITERIA.RANDOM)
        seq = genome.retrieve_feature_sequence(chr_id, transcript)
        count += 1 if seq.startswith("ATGGGG") else 0

16 transcripts starts with ATGGGG

Based on all protein-coding genes, it can be used to outline gene GC content, by calculating GC content in specific regions of the gene and taking average value from all the gene. Regions are divided here into k=50 sub-regional parts.

Used organisms in analyses are: Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Danio Rerio, Mus Musculus, Rat norvegicus, Homo sapiens

When Ensembl annotated protein-coding genes are used:

When RefSeq (NCBI) annotated protein-coding genes are used:

3rd Party Resources

gffutils - We use gffutils for working with large GFF files

LashaBuxo / Genome-Tools