earthgecko / skyline

Anomaly detection

Home Page:http://earthgecko-skyline.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Luminosity's possible process preventing condition

ashemez opened this issue · comments

At the section below in luminosity.py spin_process function, there is a possible and noisy luminosity processing prevention I detected. This is probably not faced by environments which have been processing its anomalies without data loss. I previously loaded test metrics and removed most of the test metrics, and probably cleaned up some other metrics too, during Skyline processing. This causes luminosity process to stuck at this point since the metrics couldn't be found in the DB:

if not correlated_metrics:
logger.info('no correlations found for %s anomaly id %s' % (
base_name, str(anomaly_id)))
return False

So this section always returns the spining process without setting the unprocessed found anomaly_id as processed, actually there is a data loss or cleanup somehow but anomaly_id left in the DB there. The reference anomaly_id is taken from this section actually:

if not last_processed_anomaly_id:
query = 'SELECT id FROM luminosity WHERE id=(SELECT MAX(id) FROM luminosity) ORDER BY id DESC LIMIT 1'
results = None
try:
results = mysql_select(skyline_app, query)
except:
logger.error(traceback.format_exc())
logger.error('error :: MySQL quey failed - %s' % query)
if results:
try:
last_processed_anomaly_id = int(results[0][0])
logger.info('last_processed_anomaly_id found from DB - %s' % str(last_processed_anomaly_id))
except:
logger.error(traceback.format_exc())

Due to data loss in terms of metrics we should alter this query: SELECT id FROM luminosity WHERE id=(SELECT MAX(id) FROM luminosity) ORDER BY id DESC LIMIT 1
On the other hand, even this query should simply be rewritten like since we don't need a subselect here: SELECT MAX(id) FROM luminosity.

But I am still uncomfortable of this query which is not properly handling the luminosity unprocessed anomaly ids. First of all, MAX(id) doesn't guarantee the latest unprocessed anomaly id as you know, because the sequence primary ids in a table can be obtained as previously deleted record ids, since they are available at that moment.

I tried this query after the above query in my environment and this works and doesn't stuck the spining process at the non-processable anomaly-id due to non-existing metrics:

now = int(time())
after = now - 600
query = 'SELECT id FROM anomalies WHERE id NOT IN (SELECT DISTINCT id FROM luminosity) AND anomaly_timestamp > \'%s\' ORDER BY anomaly_timestamp ASC LIMIT 1' % str(after)

The time range can be arrange to an optimum window but this is better for an ideal DB query. But I'm still uncomfortable with this condition WHERE id NOT IN (SELECT DISTINCT id FROM luminosity) since it will have a continuously increasing record count in the luminosity table and would cause a performance decrease in a very long run. It could be better to have a luminosity_processed flag in the anomaly table as well for this case and condition could be changed like this:

query = 'SELECT id FROM anomalies WHERE luminosity_processed=0 AND anomaly_timestamp > \'%s\' ORDER BY anomaly_t imestamp ASC LIMIT 1' % str(after)

@earthgecko I just pushed a new branch in my fork https://github.com/ashemez/skyline/tree/20210620-luminosity-mysql-check could you please review that? I removed other MySQL checks.