nds / ensembl-production-imported

Production pipelines used by data-teams to process imported genomes (former EnsemblGenomes pipelines)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ensembl production scripts for loading ad-hoc annotations (former eg-pipelines)

Prerequisites

Pipelines are intended to be run inside the Ensembl production environment. Please, make sure you have all the proper credential, keys, etc. set up.

Installation and configuration

Getting this repo

git clone git@github.com:Ensembl/ensembl-production-imported.git

Configuration

Refresing environment

Add lib/perl to PERL5LIB env (use instead of modules), and lib/python to PYTHONPATH env

export ENS_ROOT_DIR=$(pwd) # or whatever -- path to the dir to where the repo(s) was(were) cloned

export PERL5LIB=${PERL5LIB}:${ENS_ROOT_DIR}/ensembl-production-imported/lib/perl
export PYTHONPATH=${PYTHONPATH}:${ENS_ROOT_DIR}/ensembl-production-imported/lib/python

N.B. Please, predefine ENS_ROOT_DIR env.

Updating / setting default configuration options

To deal with the system specific configuration options Bio::EnsEMBL::EGPipeline::PrivateConfDetails module is used. The actual configuration is loaded from Bio::EnsEMBL::EGPipeline::PrivateConfDetails::Impl.

All the used options are listed in Impl.pm.example. Please, define them before running pipelines.

This can be done either by copying this file and editing it.

cp ensembl-production-imported/lib/perl/Bio/EnsEMBL/EGPipeline/PrivateConfDetails/Impl.pm{.example,}
# edit ensembl-production-imported/lib/perl/Bio/EnsEMBL/EGPipeline/PrivateConfDetails/Impl.pm

Or by creating a separate repo with lib/perl/Bio/EnsEMBL/EGPipeline/PrivateConfDetails/Impl.pm and adding corresponding lib/perl to your PERL5LIB env.

Resources and queues

You can override the default queue used to run pipeline by adding -queue_name option to the init_pipeline.pl command (see below).

Initialising and running pipelines

Every pipeline is derived from Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf (see EGGeneric documentation) for details.

And the same perl class prefix used for every pipeline: Bio::EnsEMBL::EGPipeline::PipeConfig:: .

N.B. Don't forget to specify -reg_file option for the beekeeper.pl -url $url -reg_file $REG_FILE -loop command.

init_pipeline.pl Bio::EnsEMBL::EGPipeline::PipeConfig::RNAFeatures_conf \
    $($CMD details script) \
    -hive_force_init 1\
    -queue_name $SPECIFIC_QUEUE_NAME \
    -registry $REG_FILE \
    -production_db "$($PROD_SERVER details url)""$PROD_DBNAME" \
    -pipeline_tag "_${SPECIES_TAG}" \
    -pipeline_dir $OUT_DIR/rna_features \
    -species $SPECIES \
    -eg_pipelines_dir $ENS_DIR/ensembl-production-imported \
    ${OTHER_OPTIONS} \
    2> $OUT_DIR/init.stderr \
    1> $OUT_DIR/init.stdout

SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -sync

LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -reg_file $REG_FILE -loop

$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout

Pipelines

Pipeline name Module Description Document Comment
EGGeneric Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf generic pipeline configuration EGGeneric All other pipelines are derived from this one
RepeatModeler Bio::EnsEMBL::EGPipeline::PipeConfig::RepeatModeler_conf Building de-nove repeat libs RepeatModeler
DNAFeatures Bio::EnsEMBL::EGPipeline::PipeConfig::DNAFeatures_conf repeat masking DNAFeatures redat_repeatmasker_library should be explicitly specified
RNAFeatures Bio::EnsEMBL::EGPipeline::PipeConfig::RNAFeatures_conf Non-coding rna features (tRNA, miRNA, etc) discovery RNAFeatures
RNAGenes Bio::EnsEMBL::EGPipeline::PipeConfig::RNAGenes_conf Create non-coding genes based on rna features RNAGenes Specify id_db_{host,port,user,dbname,...} options if run_context != "VB"
SRAAlignment_BRC4 Bio::EnsEMBL::EGPipeline::PipeConfig::SRAAlignment_BRC4_conf Perform RNA(DNA) short read aligments SRAAlignment_BRC4
WGA2GenesDirect Bio::EnsEMBL::EGPipeline::PipeConfig::WGA2GenesDirect_conf Project transripts and create genes based on compara lastz mappings WGA2GenesDirect
Xref_GPR Bio::EnsEMBL::EGPipeline::PipeConfig::Xref_GPR_conf Load Plant Reactome data Xref_GPR use -uppercase_gene_id 1 option to allow usage of uppercase gene stable IDs for mapping (i.e. for Oryza sativa (rice))
AlignmentXref Bio::EnsEMBL::EGPipeline::PipeConfig::AlignmentXref_conf Alignment bases xrefs AlignmentXref Used as a part of the AllXref pipeline
Xref Bio::EnsEMBL::EGPipeline::PipeConfig::Xref_conf MD5-based UniParc/Uniprot Xref pipeline Xref Used as a part of the AllXref pipeline
AllXref Bio::EnsEMBL::EGPipeline::PipeConfig::AllXref_conf Combined Xref/AlignmentXref pipeline AllXref
FindPHIBaseCandidates Bio::EnsEMBL::EGPipeline::PipeConfig::FindPHIBaseCandidates_conf Load Xrefs from PHIBase FindPHIBaseCandidates
Map_interspecies_interactions Bio::EnsEMBL::EGPipeline::PipeConfig::Map_interspecies_interactions_conf Loads interactions to Ensembl InterspeciesinteractionsDB from different sources Map_interspecies_interactions

Obsolete pipelines

Pipeline name Module Description Document Comment Alternative
AnalyzeTables Bio::EnsEMBL::EGPipeline::PipeConfig::AnalyzeTables_conf Runs SQL ANALIZE / OPTIMIZE on tables for DBs present in the registry
EC2Rhea Bio::EnsEMBL::EGPipeline::PipeConfig::EC2Rhea_conf Adding chemical and transport reactions (Rhea2RC) xrefs (used by 'microbes') Specify ec2rhea_file as there's no default
ExonerateAlignment Bio::EnsEMBL::EGPipeline::PipeConfig::ExonerateAlignment_conf Aligning Fasta files to a genome with Exonerate Specify -exonerate_2_4_dir option if use exonerate-server ( -use_exonerate_server 1)
ShortReadAlignment Bio::EnsEMBL::EGPipeline::PipeConfig::ShortReadAlignment_conf
STARAlignment Bio::EnsEMBL::EGPipeline::PipeConfig::STARAlignment_conf
BlastNucleotide Bio::EnsEMBL::EGPipeline::PipeConfig::BlastNucleotide_conf
BlastProtein Bio::EnsEMBL::EGPipeline::PipeConfig::BlastProtein_conf EGPipeline::FileDump::GFF3Dumper could not be replaced with Production::Pipeline::GFF3::DumpFile as no join_align_feature param is provided
Bam2BigWig Bio::EnsEMBL::EGPipeline::PipeConfig::Bam2BigWig_conf
ProjectGenes Bio::EnsEMBL::EGPipeline::PipeConfig::ProjectGenes_conf
ProjectGeneDesc Bio::EnsEMBL::EGPipeline::PipeConfig::ProjectGeneDesc_conf

Replaced pipelines

Old pipeline module Alternative Description Document Comment
CoreStatistics Bio::EnsEMBL::Production::Pipeline::PipeConfig::CoreStatistics_conf Core stats pipeline use -skip_metadata_check 1 if core is not submitted (always for new species); set proper -pipeline_dir, -scratch_small_dir and -scratch_large_dir (see Bio::EnsEMBL::Production::Pipeline::PipeConfig::Base_conf)
FileDump Bio::EnsEMBL::Production::Pipeline::PipeConfig::FileDump_conf Serialize core
FileDump{Compara,GFF} same as above
FileDumpVEP Bio::EnsEMBL::Production::Pipeline::PipeConfig::FileDumpVEP_conf Dump VEP data
LoadGFF3 Bio::EnsEMBL::Pipeline::PipeConfig::LoadGFF3_conf Load gene models from GFF3 and accompanied files See new_genome_loader for details
LoadGFF3Batch Bio::EnsEMBL::Pipeline::PipeConfig::LoadGFF3Batch_conf Batch load models from GFF3 files See new_genome_loader for details
GeneTreeHighlighting Bio::EnsEMBL::Production::Pipeline::PipeConfig::GeneTreeHighlighting Populate compara table with GO and InterPro terms, to enable highlighting
GetOrthologs Bio::EnsEMBL::Production::Pipeline::PipeConfig::DumpOrtholog

Runnables worth additional mentioning

Runnable Description Document Comment
Common::RunnableDB::CreateOFDatabase
Analysis::Config::General

Scripts

Script Description Document Comment
brc4/repeat_for_masker.pl ....
brc4/repeat_tab_to_list.pl ....
misc_scripts/get_trans.pl get transcriptions and tranaslations In pipelines use Bio::EnsEMBL::EGPipeline::Common::RunnableDB::DumpProteome and Bio::EnsEMBL::EGPipeline::Common::RunnableDB::DumpTranscriptome
misc_scripts/load_xref.pl
misc_scripts/remove_entities.pl
misc_scripts/gene_stable_id_mapping.pl
misc_scripts/add_karyotype.pl
misc_scripts/load_karyotype_from_gff.pl
misc_scripts/gene_stable_id_mapping.pl
rna_features/add_rfam_desc.pl prepare Rfam db for RNAFeatures RNAFeatures
rna_features/taxonomic_levels.pl prepare Rfam db for RNAFeatures RNAFeatures
phi_ontology/phi-base_ontologies.pl normalising phi-base data .csv based on onlologies in scripts/phi_ontology FindPHIBaseCandidates

Replaced scripts

Script Substitution Document Comment
production_db/analysis_desc_from_prod.pl Bio::EnsEMBL::Production::Pipeline::PipeConfig::ProductionDBSync_conf
production_db/attrib_type_from_prod.pl Bio::EnsEMBL::Production::Pipeline::PipeConfig::ProductionDBSync_conf
production_db/external_db_from_prod.pl Bio::EnsEMBL::Production::Pipeline::PipeConfig::ProductionDBSync_conf
production_db/add_species_analysis.pl Bio::EnsEMBL::Production::Pipeline::PipeConfig::ProductionDBSync_conf

Various docs

See docs

TODO

Tests, tests, tests...

Acknowledgements

For obvoius reason the whole history of the source project had to go. Most of this code and documentation is inherited from the EnsemblGenomes project.

We appreciate the effort and time spent by developers of the EnsemblGenomes project.

Thank you!

About

Production pipelines used by data-teams to process imported genomes (former EnsemblGenomes pipelines)

License:Apache License 2.0


Languages

Language:Perl 92.0%Language:Python 7.9%Language:AngelScript 0.1%