Why some of the alert rules stop retrieving？

Question

Why some of the alert rules stop retrieving？

userzhangqg opened this issue 10 months ago · comments

Why some of my alert rules stopped searching after a network timeout?
When the network problem occurred, the query was interrupted. Not sure if it is strongly related to network problems. When I restarted the service, the problem was resolved, but it appeared again some time later

error log:

2023-08-25T00:10:22.415045935Z requests.exceptions.ReadTimeout: HTTPConnectionPool(host='10.10.xxx.xxx', port=9200): Read timed out. (read timeout=60)
2023-08-25T00:10:22.415957613Z ERROR:elastalert:Error writing alert info to Elasticsearch: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='10.10.xxx.xxx', port=9200): Read timed out. (read timeout=60))
2023-08-25T00:10:22.415977697Z Traceback (most recent call last):
2023-08-25T00:10:22.415980521Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
2023-08-25T00:10:22.415983119Z six.raise_from(e, None)
2023-08-25T00:10:22.415985367Z File "", line 3, in raise_from
2023-08-25T00:10:22.415988231Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
2023-08-25T00:10:22.415990426Z httplib_response = conn.getresponse()
2023-08-25T00:10:22.415992524Z File "/usr/local/lib/python3.10/http/client.py", line 1374, in getresponse
2023-08-25T00:10:22.415994820Z response.begin()
2023-08-25T00:10:22.415997066Z File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
2023-08-25T00:10:22.415999308Z version, status, reason = self._read_status()
2023-08-25T00:10:22.416001510Z File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
2023-08-25T00:10:22.416003717Z line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
2023-08-25T00:10:22.416005962Z File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
2023-08-25T00:10:22.416008089Z return self._sock.recv_into(b)
2023-08-25T00:10:22.416010135Z TimeoutError: timed out
2023-08-25T00:10:22.416012290Z
2023-08-25T00:10:22.416014463Z During handling of the above exception, another exception occurred:
2023-08-25T00:10:22.416016653Z
2023-08-25T00:10:22.416018689Z Traceback (most recent call last):
2023-08-25T00:10:22.416020785Z File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
2023-08-25T00:10:22.416022953Z resp = conn.urlopen(
2023-08-25T00:10:22.416024971Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
2023-08-25T00:10:22.416027153Z retries = retries.increment(
2023-08-25T00:10:22.416029335Z File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
2023-08-25T00:10:22.416031597Z raise six.reraise(type(error), error, _stacktrace)
2023-08-25T00:10:22.416033704Z File "/usr/local/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise
2023-08-25T00:10:22.416035951Z raise value
2023-08-25T00:10:22.416038025Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
2023-08-25T00:10:22.416064496Z httplib_response = self._make_request(
2023-08-25T00:10:22.416067118Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 451, in _make_request
2023-08-25T00:10:22.416069400Z self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
2023-08-25T00:10:22.416072284Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
2023-08-25T00:10:22.416074481Z raise ReadTimeoutError(
2023-08-25T00:10:22.416076522Z urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.10.xxx.xxx', port=9200): Read timed out. (read timeout=60)
2023-08-25T00:10:22.416078740Z
2023-08-25T00:10:22.416080853Z During handling of the above exception, another exception occurred:
2023-08-25T00:10:22.416083073Z
2023-08-25T00:10:22.416085048Z Traceback (most recent call last):
2023-08-25T00:10:22.416087459Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_requests.py", line 161, in perform_request
2023-08-25T00:10:22.416089913Z response = self.session.send(prepared_request, **send_kwargs)
2023-08-25T00:10:22.416091991Z File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
2023-08-25T00:10:22.416094191Z r = adapter.send(request, **kwargs)
2023-08-25T00:10:22.416096231Z File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 578, in send
2023-08-25T00:10:22.416098610Z raise ReadTimeout(e, request=request)
2023-08-25T00:10:22.416100830Z requests.exceptions.ReadTimeout: HTTPConnectionPool(host='10.10.128.82', port=9200): Read timed out. (read timeout=60)
2023-08-25T00:10:22.416103129Z
2023-08-25T00:10:22.416105160Z During handling of the above exception, another exception occurred:
2023-08-25T00:10:22.416107275Z
2023-08-25T00:10:22.416109224Z Traceback (most recent call last):
2023-08-25T00:10:22.416111330Z File "/usr/local/lib/python3.10/site-packages/elastalert/elastalert.py", line 1455, in writeback
2023-08-25T00:10:22.416113508Z res = self.writeback_es.index(index=index, body=body)
2023-08-25T00:10:22.416115728Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped
2023-08-25T00:10:22.416123485Z return func(*args, params=params, headers=headers, **kwargs)
2023-08-25T00:10:22.416125898Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/init.py", line 398, in index
2023-08-25T00:10:22.416128176Z return self.transport.perform_request(
2023-08-25T00:10:22.416130213Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 392, in perform_request
2023-08-25T00:10:22.416132406Z raise e
2023-08-25T00:10:22.416134382Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 358, in perform_request
2023-08-25T00:10:22.416161284Z status, headers_response, data = connection.perform_request(
2023-08-25T00:10:22.416163776Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_requests.py", line 176, in perform_request
2023-08-25T00:10:22.416166090Z raise ConnectionTimeout("TIMEOUT", str(e), e)
2023-08-25T00:10:22.416168368Z elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='10.10.xxx.xxx', port=9200): Read timed out. (read timeout=60))
2023-08-25T00:10:22.416171455Z WARNING:elastalert:Querying from 2023-08-25 07:54 CST to 2023-08-25 08:09 CST took longer than 0:00:30!
2023-08-25T00:10:26.406173555Z WARNING:apscheduler.scheduler:Execution of job "Rule: smbee-log-warn-live1-smebee-live (trigger: interval[0:00:30], next run at: 2023-08-25 08:10:26 CST)" skipped: maximum number of running instances reached (1)

As you can see from the ELASTALERT2 search log in the ES store, there have been two interrupts in the last 30 days, I have restarted the service twice, and the alert rule will resume

my version: 2.6.0

my config:

    rules_folder: /opt/elastalert/rules
    scan_subdirectories: true
    run_every:
      seconds: 30
    buffer_time:
      minutes: 15
    es_host: xxx
    es_port: 9200
    es_conn_timeout: 60
    es_username: xxx
    es_password: xxx
    writeback_index: elastalert2-env-live
    use_ssl: false
    alert_time_limit:
      minutes: 2880

rule:

6d5ab811-f89c-46c9-b612-f9bde3002e3e-live.yaml: |
    name: sql-error-warn-fac-live
    alert:
    - alertmanager
    filter:
    - query:
        query_string:
          query: '"Data truncation" OR "TooManyResultsException"'
    index: lls.fac.prd-%Y.%m.%d
    is_enabled: true
    num_events: 1
    runevery:
      seconds: 30
    timeframe:
      minutes: 15
    timestamp_field: '@timestamp'
    timestamp_type: iso
    type: frequency
    use_strftime_index: true
    ignore_null: true
    alertmanager_hosts:
    - http://alertmanager-main:9093
    alertmanager_api_version: v1
    alertmanager_alertname: sql-error-warn-fac-live
    alertmanager_annotations:
      beecloud_notify_channel: ""
      beecloud_notify_template: ""
      beecloud_notify_webhook: ""
      beecloud_project: fac
      beecloud_receiver_groups: ""
      beecloud_receivers: ""
    alertmanager_labels:
      cluster_id: logging
      raw_rule: sql-error-warn
      severity: critical
      source_id: 6d5ab811-f89c-46c9-b612-f9bde3002e3e
    alertmanager_fields:
      env: env
      host_name: host.name
      hostname: hostname
      log: log
      log_file_path: log.file.path
      logPath: logPath
      message: message
      num_hits: num_hits
      num_matches: num_matches
      project: project
      service: service
      timestamp: '@timestamp'
      trace_id: trace_id

Jason Ertel · Answer 1 · Fri Aug 25 2023 18:59:45 GMT+0800 (China Standard Time)

There's not enough of the log included for me to understand what happened to ElastAlert 2 after the connection problem. If you are stating that the entire ElastAlert 2 application ceased to function then I will need those logs.

However, the discussion title suggests that only the single rule is not longer firing. ElastAlert 2 will auto disable failed rules. This behavior can be disabled if desired.