GMOD / Apollo

Genome annotation editor with a Java Server backend and a Javascript client that runs in a web browser as a JBrowse plugin.

Home Page:http://genomearchitect.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance

hexylena opened this issue · comments

@abretaud and I are working to debug an issue where the slowness of findAllOrganisms (>30s) is killing the training we're giving.

This route should be fast. like <2 seconds fast. I've replaced it with a flask app that talks directly to the DB and does all of the joins and filtering on the DB side which seems to be MUCH more efficient.

Here's the flask app which just replaces that one route.

from flask import Flask
import codecs
from flask import jsonify
from functools import wraps
from flask import render_template
from flask import request
from flask_sqlalchemy import SQLAlchemy
import time

global CACHED_RESULT
global CACHED_TIME
CACHED_RESULT = None
CACHED_TIME = 0

app = Flask(__name__)

app.config["SQLALCHEMY_DATABASE_URI"] = "postgresql://...:5432/apollo"
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False
db = SQLAlchemy(app)

QUERY = """
SELECT
    organism.common_name,
    organism.blatdb,
    organism.metadata,
    organism.obsolete,
    organism.directory,
    organism.public_mode,
    organism.valid,
    organism.genome_fasta_index,
    organism.genus,
    organism.species,
    organism.id,
    organism.non_default_translation_table,
    organism.genome_fasta,
    false AS currentorganism,
    sum(
        CASE
        WHEN feature.class
        IN (
                'org.bbop.apollo.RepeatRegion',
                'org.bbop.apollo.Terminator',
                'org.bbop.apollo.TransposableElement',
                'org.bbop.apollo.Gene',
                'org.bbop.apollo.Pseudogene',
                'org.bbop.apollo.PseudogenicRegion',
                'org.bbop.apollo.ProcessedPseudogene',
                'org.bbop.apollo.Deletion',
                'org.bbop.apollo.Insertion',
                'org.bbop.apollo.Substitution',
                'org.bbop.apollo.SNV',
                'org.bbop.apollo.SNP',
                'org.bbop.apollo.MNV',
                'org.bbop.apollo.MNP',
                'org.bbop.apollo.Indel'
            )
        THEN 1
        ELSE 0
        END
    ) AS annotationcount,
    count(distinct sequence.id) AS sequences
FROM
    organism
    LEFT OUTER JOIN sequence ON organism.id = sequence.organism_id
    LEFT OUTER JOIN feature_location ON
            sequence.id = feature_location.sequence_id
    LEFT OUTER JOIN feature ON
            feature.id = feature_location.feature_id
GROUP BY
    organism.common_name,
    organism.blatdb,
    organism.metadata,
    organism.obsolete,
    organism.directory,
    organism.public_mode,
    organism.valid,
    organism.genome_fasta_index,
    organism.genus,
    organism.species,
    organism.id,
    organism.non_default_translation_table,
    organism.genome_fasta
    ;
"""

columns =  [
    "commonName", "blatdb", "metadata" , "obsolete", "directory",
    "publicMode", "valid", "genomeFastaIndex", "genus", "species", "id",
    "nonDefaultTranslationTable", "genomeFasta", "currentOrganism",
    "annotationCount", "sequences"
]

def _fetch():
    roles = db.engine.execute(QUERY)
    out = []
    for role in roles:
        out.append(dict(zip(columns, role)))
    return out


@app.route("/get", methods=["GET", "POST"])
def doit():
    global CACHED_TIME
    global CACHED_RESULT
    now = time.time()
    if now - CACHED_TIME > 30:
        CACHED_RESULT = _fetch()
        CACHED_TIME = now

    return jsonify(CACHED_RESULT)

I'm running this service and we're just proxying that one route through our own version:

location /apollo/organism/findAllOrganisms {
   proxy_pass http://127.0.0.1:4321/get;
}

I think there are a couple parts to the issue:

  • lack of any indexes, not even on sequence.id, feature.id, etc.
  • doing operations in groovy rather than doing them in the DB, resulting in fetching more data and processing more slowly than the DB can.

A key point for me is that I really don't think apollo needs a graph database. I think it just needs some time spent understanding how to most effectively use SQL (I'm happy to offer my expertise there.)

On my side I've made some profiling: most of the time is spent in this for loop, it takes ~0.2s per organism on my test setup => ~10sec for 40 orgs => you can easily hit a timeout if you have many orgs

@hexylena I took the liberty to dockerize your code there: https://github.com/galaxy-genome-annotation/apolpi
I hope/guess it's ok for you (licensing too?)

Ahhh awesome @abretaud that'll make it easier to deploy.

yeah license is fine :) (Normally I'd do agpl3 to force folks to contribute back their changes, but, in this case I don't think it matters)

Cool thanks, used on apololo.genouest.org and bipaa.genouest.org/apollo now

Update organism info took 30s to change the name which feels extremely high, maybe that'll be the next target for apolpi.

image

But honestly cannot see anything in the code that would cause it to be that slow, that's wild. There's no for loops, nothing

Yep I noticed it's slow too but I don't know why, maybe it's doing things on the data dir!?

It's odd, the API responds quickly, it was just through the UI. Anyway