Performance

Question

Performance

hexylena opened this issue 3 years ago · comments

@abretaud and I are working to debug an issue where the slowness of findAllOrganisms (>30s) is killing the training we're giving.

This route should be fast. like <2 seconds fast. I've replaced it with a flask app that talks directly to the DB and does all of the joins and filtering on the DB side which seems to be MUCH more efficient.

Here's the flask app which just replaces that one route.

from flask import Flask
import codecs
from flask import jsonify
from functools import wraps
from flask import render_template
from flask import request
from flask_sqlalchemy import SQLAlchemy
import time

global CACHED_RESULT
global CACHED_TIME
CACHED_RESULT = None
CACHED_TIME = 0

app = Flask(__name__)

app.config["SQLALCHEMY_DATABASE_URI"] = "postgresql://...:5432/apollo"
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False
db = SQLAlchemy(app)

QUERY = """
SELECT
    organism.common_name,
    organism.blatdb,
    organism.metadata,
    organism.obsolete,
    organism.directory,
    organism.public_mode,
    organism.valid,
    organism.genome_fasta_index,
    organism.genus,
    organism.species,
    organism.id,
    organism.non_default_translation_table,
    organism.genome_fasta,
    false AS currentorganism,
    sum(
        CASE
        WHEN feature.class
        IN (
                'org.bbop.apollo.RepeatRegion',
                'org.bbop.apollo.Terminator',
                'org.bbop.apollo.TransposableElement',
                'org.bbop.apollo.Gene',
                'org.bbop.apollo.Pseudogene',
                'org.bbop.apollo.PseudogenicRegion',
                'org.bbop.apollo.ProcessedPseudogene',
                'org.bbop.apollo.Deletion',
                'org.bbop.apollo.Insertion',
                'org.bbop.apollo.Substitution',
                'org.bbop.apollo.SNV',
                'org.bbop.apollo.SNP',
                'org.bbop.apollo.MNV',
                'org.bbop.apollo.MNP',
                'org.bbop.apollo.Indel'
            )
        THEN 1
        ELSE 0
        END
    ) AS annotationcount,
    count(distinct sequence.id) AS sequences
FROM
    organism
    LEFT OUTER JOIN sequence ON organism.id = sequence.organism_id
    LEFT OUTER JOIN feature_location ON
            sequence.id = feature_location.sequence_id
    LEFT OUTER JOIN feature ON
            feature.id = feature_location.feature_id
GROUP BY
    organism.common_name,
    organism.blatdb,
    organism.metadata,
    organism.obsolete,
    organism.directory,
    organism.public_mode,
    organism.valid,
    organism.genome_fasta_index,
    organism.genus,
    organism.species,
    organism.id,
    organism.non_default_translation_table,
    organism.genome_fasta
    ;
"""

columns =  [
    "commonName", "blatdb", "metadata" , "obsolete", "directory",
    "publicMode", "valid", "genomeFastaIndex", "genus", "species", "id",
    "nonDefaultTranslationTable", "genomeFasta", "currentOrganism",
    "annotationCount", "sequences"
]

def _fetch():
    roles = db.engine.execute(QUERY)
    out = []
    for role in roles:
        out.append(dict(zip(columns, role)))
    return out


@app.route("/get", methods=["GET", "POST"])
def doit():
    global CACHED_TIME
    global CACHED_RESULT
    now = time.time()
    if now - CACHED_TIME > 30:
        CACHED_RESULT = _fetch()
        CACHED_TIME = now

    return jsonify(CACHED_RESULT)

I'm running this service and we're just proxying that one route through our own version:

location /apollo/organism/findAllOrganisms {
   proxy_pass http://127.0.0.1:4321/get;
}

I think there are a couple parts to the issue:

lack of any indexes, not even on sequence.id, feature.id, etc.
doing operations in groovy rather than doing them in the DB, resulting in fetching more data and processing more slowly than the DB can.

A key point for me is that I really don't think apollo needs a graph database. I think it just needs some time spent understanding how to most effectively use SQL (I'm happy to offer my expertise there.)

Anthony Bretaudeau · Answer 1 · Fri Jul 02 2021 19:55:03 GMT+0800 (China Standard Time)

On my side I've made some profiling: most of the time is spent in this for loop, it takes ~0.2s per organism on my test setup => ~10sec for 40 orgs => you can easily hit a timeout if you have many orgs

Anthony Bretaudeau · Answer 2 · Fri Aug 27 2021 23:18:43 GMT+0800 (China Standard Time)

@hexylena I took the liberty to dockerize your code there: https://github.com/galaxy-genome-annotation/apolpi
I hope/guess it's ok for you (licensing too?)

Helena · Answer 3 · Mon Aug 30 2021 16:51:34 GMT+0800 (China Standard Time)

Ahhh awesome @abretaud that'll make it easier to deploy.

yeah license is fine :) (Normally I'd do agpl3 to force folks to contribute back their changes, but, in this case I don't think it matters)

Anthony Bretaudeau · Answer 4 · Mon Aug 30 2021 17:29:20 GMT+0800 (China Standard Time)

Cool thanks, used on apololo.genouest.org and bipaa.genouest.org/apollo now

Helena · Answer 5 · Fri Sep 03 2021 18:25:39 GMT+0800 (China Standard Time)

Update organism info took 30s to change the name which feels extremely high, maybe that'll be the next target for apolpi.

But honestly cannot see anything in the code that would cause it to be that slow, that's wild. There's no for loops, nothing

Anthony Bretaudeau · Answer 6 · Fri Sep 03 2021 19:55:53 GMT+0800 (China Standard Time)

Yep I noticed it's slow too but I don't know why, maybe it's doing things on the data dir!?

Helena · Answer 7 · Mon Sep 06 2021 16:03:39 GMT+0800 (China Standard Time)

It's odd, the API responds quickly, it was just through the UI. Anyway