Why some of the alert rules stop retrieving?
userzhangqg opened this issue · comments
Why some of my alert rules stopped searching after a network timeout?
When the network problem occurred, the query was interrupted. Not sure if it is strongly related to network problems. When I restarted the service, the problem was resolved, but it appeared again some time later
error log:
2023-08-25T00:10:22.415045935Z requests.exceptions.ReadTimeout: HTTPConnectionPool(host='10.10.xxx.xxx', port=9200): Read timed out. (read timeout=60)
2023-08-25T00:10:22.415957613Z ERROR:elastalert:Error writing alert info to Elasticsearch: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='10.10.xxx.xxx', port=9200): Read timed out. (read timeout=60))
2023-08-25T00:10:22.415977697Z Traceback (most recent call last):
2023-08-25T00:10:22.415980521Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
2023-08-25T00:10:22.415983119Z six.raise_from(e, None)
2023-08-25T00:10:22.415985367Z File "", line 3, in raise_from
2023-08-25T00:10:22.415988231Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
2023-08-25T00:10:22.415990426Z httplib_response = conn.getresponse()
2023-08-25T00:10:22.415992524Z File "/usr/local/lib/python3.10/http/client.py", line 1374, in getresponse
2023-08-25T00:10:22.415994820Z response.begin()
2023-08-25T00:10:22.415997066Z File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
2023-08-25T00:10:22.415999308Z version, status, reason = self._read_status()
2023-08-25T00:10:22.416001510Z File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
2023-08-25T00:10:22.416003717Z line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
2023-08-25T00:10:22.416005962Z File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
2023-08-25T00:10:22.416008089Z return self._sock.recv_into(b)
2023-08-25T00:10:22.416010135Z TimeoutError: timed out
2023-08-25T00:10:22.416012290Z
2023-08-25T00:10:22.416014463Z During handling of the above exception, another exception occurred:
2023-08-25T00:10:22.416016653Z
2023-08-25T00:10:22.416018689Z Traceback (most recent call last):
2023-08-25T00:10:22.416020785Z File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
2023-08-25T00:10:22.416022953Z resp = conn.urlopen(
2023-08-25T00:10:22.416024971Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
2023-08-25T00:10:22.416027153Z retries = retries.increment(
2023-08-25T00:10:22.416029335Z File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
2023-08-25T00:10:22.416031597Z raise six.reraise(type(error), error, _stacktrace)
2023-08-25T00:10:22.416033704Z File "/usr/local/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise
2023-08-25T00:10:22.416035951Z raise value
2023-08-25T00:10:22.416038025Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
2023-08-25T00:10:22.416064496Z httplib_response = self._make_request(
2023-08-25T00:10:22.416067118Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 451, in _make_request
2023-08-25T00:10:22.416069400Z self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
2023-08-25T00:10:22.416072284Z File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
2023-08-25T00:10:22.416074481Z raise ReadTimeoutError(
2023-08-25T00:10:22.416076522Z urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.10.xxx.xxx', port=9200): Read timed out. (read timeout=60)
2023-08-25T00:10:22.416078740Z
2023-08-25T00:10:22.416080853Z During handling of the above exception, another exception occurred:
2023-08-25T00:10:22.416083073Z
2023-08-25T00:10:22.416085048Z Traceback (most recent call last):
2023-08-25T00:10:22.416087459Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_requests.py", line 161, in perform_request
2023-08-25T00:10:22.416089913Z response = self.session.send(prepared_request, **send_kwargs)
2023-08-25T00:10:22.416091991Z File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
2023-08-25T00:10:22.416094191Z r = adapter.send(request, **kwargs)
2023-08-25T00:10:22.416096231Z File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 578, in send
2023-08-25T00:10:22.416098610Z raise ReadTimeout(e, request=request)
2023-08-25T00:10:22.416100830Z requests.exceptions.ReadTimeout: HTTPConnectionPool(host='10.10.128.82', port=9200): Read timed out. (read timeout=60)
2023-08-25T00:10:22.416103129Z
2023-08-25T00:10:22.416105160Z During handling of the above exception, another exception occurred:
2023-08-25T00:10:22.416107275Z
2023-08-25T00:10:22.416109224Z Traceback (most recent call last):
2023-08-25T00:10:22.416111330Z File "/usr/local/lib/python3.10/site-packages/elastalert/elastalert.py", line 1455, in writeback
2023-08-25T00:10:22.416113508Z res = self.writeback_es.index(index=index, body=body)
2023-08-25T00:10:22.416115728Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/utils.py", line 152, in _wrapped
2023-08-25T00:10:22.416123485Z return func(*args, params=params, headers=headers, **kwargs)
2023-08-25T00:10:22.416125898Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/init.py", line 398, in index
2023-08-25T00:10:22.416128176Z return self.transport.perform_request(
2023-08-25T00:10:22.416130213Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 392, in perform_request
2023-08-25T00:10:22.416132406Z raise e
2023-08-25T00:10:22.416134382Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 358, in perform_request
2023-08-25T00:10:22.416161284Z status, headers_response, data = connection.perform_request(
2023-08-25T00:10:22.416163776Z File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_requests.py", line 176, in perform_request
2023-08-25T00:10:22.416166090Z raise ConnectionTimeout("TIMEOUT", str(e), e)
2023-08-25T00:10:22.416168368Z elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='10.10.xxx.xxx', port=9200): Read timed out. (read timeout=60))
2023-08-25T00:10:22.416171455Z WARNING:elastalert:Querying from 2023-08-25 07:54 CST to 2023-08-25 08:09 CST took longer than 0:00:30!
2023-08-25T00:10:26.406173555Z WARNING:apscheduler.scheduler:Execution of job "Rule: smbee-log-warn-live1-smebee-live (trigger: interval[0:00:30], next run at: 2023-08-25 08:10:26 CST)" skipped: maximum number of running instances reached (1)
As you can see from the ELASTALERT2 search log in the ES store, there have been two interrupts in the last 30 days, I have restarted the service twice, and the alert rule will resume
my version: 2.6.0
my config:
rules_folder: /opt/elastalert/rules
scan_subdirectories: true
run_every:
seconds: 30
buffer_time:
minutes: 15
es_host: xxx
es_port: 9200
es_conn_timeout: 60
es_username: xxx
es_password: xxx
writeback_index: elastalert2-env-live
use_ssl: false
alert_time_limit:
minutes: 2880
rule:
6d5ab811-f89c-46c9-b612-f9bde3002e3e-live.yaml: |
name: sql-error-warn-fac-live
alert:
- alertmanager
filter:
- query:
query_string:
query: '"Data truncation" OR "TooManyResultsException"'
index: lls.fac.prd-%Y.%m.%d
is_enabled: true
num_events: 1
runevery:
seconds: 30
timeframe:
minutes: 15
timestamp_field: '@timestamp'
timestamp_type: iso
type: frequency
use_strftime_index: true
ignore_null: true
alertmanager_hosts:
- http://alertmanager-main:9093
alertmanager_api_version: v1
alertmanager_alertname: sql-error-warn-fac-live
alertmanager_annotations:
beecloud_notify_channel: ""
beecloud_notify_template: ""
beecloud_notify_webhook: ""
beecloud_project: fac
beecloud_receiver_groups: ""
beecloud_receivers: ""
alertmanager_labels:
cluster_id: logging
raw_rule: sql-error-warn
severity: critical
source_id: 6d5ab811-f89c-46c9-b612-f9bde3002e3e
alertmanager_fields:
env: env
host_name: host.name
hostname: hostname
log: log
log_file_path: log.file.path
logPath: logPath
message: message
num_hits: num_hits
num_matches: num_matches
project: project
service: service
timestamp: '@timestamp'
trace_id: trace_id
There's not enough of the log included for me to understand what happened to ElastAlert 2 after the connection problem. If you are stating that the entire ElastAlert 2 application ceased to function then I will need those logs.
However, the discussion title suggests that only the single rule is not longer firing. ElastAlert 2 will auto disable failed rules. This behavior can be disabled if desired.