Performance
hexylena opened this issue · comments
@abretaud and I are working to debug an issue where the slowness of findAllOrganisms (>30s) is killing the training we're giving.
This route should be fast. like <2 seconds fast. I've replaced it with a flask app that talks directly to the DB and does all of the joins and filtering on the DB side which seems to be MUCH more efficient.
Here's the flask app which just replaces that one route.
from flask import Flask
import codecs
from flask import jsonify
from functools import wraps
from flask import render_template
from flask import request
from flask_sqlalchemy import SQLAlchemy
import time
global CACHED_RESULT
global CACHED_TIME
CACHED_RESULT = None
CACHED_TIME = 0
app = Flask(__name__)
app.config["SQLALCHEMY_DATABASE_URI"] = "postgresql://...:5432/apollo"
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False
db = SQLAlchemy(app)
QUERY = """
SELECT
organism.common_name,
organism.blatdb,
organism.metadata,
organism.obsolete,
organism.directory,
organism.public_mode,
organism.valid,
organism.genome_fasta_index,
organism.genus,
organism.species,
organism.id,
organism.non_default_translation_table,
organism.genome_fasta,
false AS currentorganism,
sum(
CASE
WHEN feature.class
IN (
'org.bbop.apollo.RepeatRegion',
'org.bbop.apollo.Terminator',
'org.bbop.apollo.TransposableElement',
'org.bbop.apollo.Gene',
'org.bbop.apollo.Pseudogene',
'org.bbop.apollo.PseudogenicRegion',
'org.bbop.apollo.ProcessedPseudogene',
'org.bbop.apollo.Deletion',
'org.bbop.apollo.Insertion',
'org.bbop.apollo.Substitution',
'org.bbop.apollo.SNV',
'org.bbop.apollo.SNP',
'org.bbop.apollo.MNV',
'org.bbop.apollo.MNP',
'org.bbop.apollo.Indel'
)
THEN 1
ELSE 0
END
) AS annotationcount,
count(distinct sequence.id) AS sequences
FROM
organism
LEFT OUTER JOIN sequence ON organism.id = sequence.organism_id
LEFT OUTER JOIN feature_location ON
sequence.id = feature_location.sequence_id
LEFT OUTER JOIN feature ON
feature.id = feature_location.feature_id
GROUP BY
organism.common_name,
organism.blatdb,
organism.metadata,
organism.obsolete,
organism.directory,
organism.public_mode,
organism.valid,
organism.genome_fasta_index,
organism.genus,
organism.species,
organism.id,
organism.non_default_translation_table,
organism.genome_fasta
;
"""
columns = [
"commonName", "blatdb", "metadata" , "obsolete", "directory",
"publicMode", "valid", "genomeFastaIndex", "genus", "species", "id",
"nonDefaultTranslationTable", "genomeFasta", "currentOrganism",
"annotationCount", "sequences"
]
def _fetch():
roles = db.engine.execute(QUERY)
out = []
for role in roles:
out.append(dict(zip(columns, role)))
return out
@app.route("/get", methods=["GET", "POST"])
def doit():
global CACHED_TIME
global CACHED_RESULT
now = time.time()
if now - CACHED_TIME > 30:
CACHED_RESULT = _fetch()
CACHED_TIME = now
return jsonify(CACHED_RESULT)
I'm running this service and we're just proxying that one route through our own version:
location /apollo/organism/findAllOrganisms {
proxy_pass http://127.0.0.1:4321/get;
}
I think there are a couple parts to the issue:
- lack of any indexes, not even on sequence.id, feature.id, etc.
- doing operations in groovy rather than doing them in the DB, resulting in fetching more data and processing more slowly than the DB can.
A key point for me is that I really don't think apollo needs a graph database. I think it just needs some time spent understanding how to most effectively use SQL (I'm happy to offer my expertise there.)
On my side I've made some profiling: most of the time is spent in this for loop, it takes ~0.2s per organism on my test setup => ~10sec for 40 orgs => you can easily hit a timeout if you have many orgs
@hexylena I took the liberty to dockerize your code there: https://github.com/galaxy-genome-annotation/apolpi
I hope/guess it's ok for you (licensing too?)
Ahhh awesome @abretaud that'll make it easier to deploy.
yeah license is fine :) (Normally I'd do agpl3 to force folks to contribute back their changes, but, in this case I don't think it matters)
Cool thanks, used on apololo.genouest.org and bipaa.genouest.org/apollo now
Yep I noticed it's slow too but I don't know why, maybe it's doing things on the data dir!?
It's odd, the API responds quickly, it was just through the UI. Anyway