stats telemetry stops collecting after it encounters a server error
inqueue opened this issue · comments
Jason Bryan commented
Rally version (get with esrally --version
): esrally 2.9.0.dev0 (git revision: 50ebcb68d9f09de545a1bfb217fc9840b97a367e)
esrally race --pipeline=benchmark-only --track-repository="default" --track="nyc_taxis" --challenge="autoscale" --telemetry='["node-stats", "shard-stats", "blob-store-stats"]' --on-error="continue" --target-hosts=target-hosts.json --client-options=client-options.json --track-params=track-params.json --telemetry-params=telemetry-params.json --user-tags=user-tags.json --race-id=c5420fb2-d073-4a6f-a54a-f98244e9b74b --load-driver-hosts=127.0.0.1
Description of the problem including expected versus actual behavior:
Rally will stop retrying to collect stats telemetry once it has failed too many times.
- At the time of the last stats collection attempt, the benchmark showed a steady and prolonged increase in average bulk indexing latency.
- Rally recorded 0 bulk indexing failures, though indexing throughput dropped significantly.
- Subsequent manual stats calls to the cluster were successful.
Provide logs (if relevant):
2023-08-25 16:30:33,699 ActorAddr-(T|:45481)/PID:7942 esrally.telemetry ERROR Could not determine master node stats
Traceback (most recent call last):
File "~/rally/esrally/telemetry.py", line 172, in run
self.recorder.record()
File "~/rally/esrally/telemetry.py", line 2249, in record
info = self.client.nodes.info(node_id=state["master_node"], metric="os")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped
return api(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/nodes.py", line 249, in info
return self.perform_request( # type: ignore[return-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py", line 390, in perform_request
return self._client.perform_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/rally/esrally/client/synchronous.py", line 226, in perform_request
raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(message=message, meta=meta, body=resp_body)
elasticsearch.ApiError: ApiError(503, "{'ok': False, 'message': 'The requested resource is currently unavailable.'}")
The benchmark was using the default node-stats-sample-interval
of 1s
. One second seems aggressive, and I will try with a value of 10s
. We might consider a new default.
Jason Bryan commented
The issue appears to only affect the node-stats telemetry device.