StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html

Home Page:https://stackstorm.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MongoDB sort memory limit exceeded when retrieving execution history

machneit opened this issue · comments

SUMMARY

Hi,
After updating Stackstorm from 3.6 to 3.8 I ran into a sort memory limit issue with MongoDB when gathering the execution history.

STACKSTORM VERSION

st2 3.8.0, on Python 3.8.13

OS, environment, install method

Single-boxed: Stackstorm 3.8, MongoDB 5.0.6, and RabbitMQ 3.8.8 are installed on a single RedHat8 server.

Steps to reproduce the problem

You can reproduce the problem by running the 2 following commands : mongosh -> db.adminCommand({setParameter: 1, internalQueryMaxBlockingSortMemoryUsageBytes: 1}).
This limits the allowed memory usage to 1 byte which generates the exact same error I encountered.

Expected Results

I expected to see the history.

Actual Results

Error encountered when accessing the history (see detailed error message section of this issue):
image
image
This is probably due to the way Stackstorm gathers the execution history (line 1135, in _get_action_executions -> return super(ActionExecutionsController, self)._get_all() in the actionexecutions.py script).
Could this be fixed/optimized in the next Stackstorm release?
I found a temporary fix by tripling the value of the MongoDB ‘internalQueryMaxBlockingSortMemoryUsageBytes’ parameter, but the history takes more than a minute to load versus a couple seconds before. This error never apeared in Stackstorm 3.6, so it's probably related to an update in how Stackstorm gathers the history.

Detailed error message:

2023-02-23 08:06:22,140 140164375924048 ERROR router [-] Failed to call controller function "get_all" for operation "st2api.controllers.v1.actionexecutions:action_executions_controller.get_all": Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'}
Traceback (most recent call last):
File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2common/router.py", line 632, in call
resp = func(**kw)
File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/v1/actionexecutions.py", line 739, in get_all
return self._get_action_executions(
File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/v1/actionexecutions.py", line 1135, in _get_action_executions
return super(ActionExecutionsController, self)._get_all(
File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/resource.py", line 244, in _get_all
result = self.resources_model_filter(
File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/resource.py", line 570, in resources_model_filter
result = super(
File "/opt/stackstorm/st2/lib/python3.8/site-packages/st2api/controllers/resource.py", line 275, in resources_model_filter
for instance in instances[offset:eop]:
File "/opt/stackstorm/st2/lib/python3.8/site-packages/mongoengine/queryset/queryset.py", line 110, in _iter_results
self._populate_cache()
File "/opt/stackstorm/st2/lib/python3.8/site-packages/mongoengine/queryset/queryset.py", line 129, in _populate_cache
self._result_cache.append(next(self))
File "/opt/stackstorm/st2/lib/python3.8/site-packages/mongoengine/queryset/base.py", line 1574, in next
raw_doc = next(self._cursor)
File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/cursor.py", line 1207, in next
if len(self.__data) or self._refresh():
File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/cursor.py", line 1124, in _refresh
self.__send_message(q)
File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/cursor.py", line 999, in __send_message
response = client._run_operation_with_response(
File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1368, in _run_operation_with_response
return self._retryable_read(
File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1471, in _retryable_read
return func(session, server, sock_info, slave_ok)
File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1360, in _cmd
return server.run_operation_with_response(
File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/server.py", line 135, in run_operation_with_response
_check_command_response(first, sock_info.max_wire_version)
File "/opt/stackstorm/st2/lib/python3.8/site-packages/pymongo/helpers.py", line 164, in _check_command_response
raise OperationFailure(errmsg, code, response, max_wire_version)
pymongo.errors.OperationFailure: Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'} (_exception_class='OperationFailure',_exception_message="Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'}",_exception_data={'_message': "Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in., full error: {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'}", '_error_labels': set(), '_OperationFailure__code': 292, '_OperationFailure__details': {'ok': 0.0, 'errmsg': 'Executor error during find command :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.', 'code': 292, 'codeName': 'QueryExceededMemoryLimitNoDiskUseAllowed'}, '_OperationFailure__max_wire_version': 13})

Hi,

Thank you for reporting this issue. At this time, mongodb v5 is not supported by StackStorm. You will need to use mongodb v4.4

Hi,
Do you know when MongoDBv5 will be supported ?

I experience the same issue.
Is there any possible fix or workaround?
Is there a way to add the allowDiskUse:true option to the history query?

Also experiencing this issue on mongo 4.4
Seems like I ran an action with a large dataset or I ran it too many times? Not sure. I can filter different actions and see them fine but when I choose this specific action (or all actions) I get the error.
I noticed that the stdout of these workflows are extremely large, so that’s definitely why it’s causing the errors. The total result of multiple executions surpass the Mongdb memory limit

Also seeing this with Stackstorm 3.8 on Mongo 4.4. I had to roll back to Stackstorm 3.6, which does not manifest this issue.

Is there a workaround or fix expected to resolve this?

commented

Hi,
I'm facing the same problem after an update from 3.6 to 3.8, is there any fix planned ?

commented

Thanks for the report.

Could be a regression?
If someone could try to dig deeper into the v3.8.0 changelog (https://github.com/StackStorm/st2/releases/tag/v3.8.0) and identify the PR that caused it, as well as provide a PR to fix it, - that'll get it into the next release.

commented

This happens with very large output/input action executions on the history page.
To make a temporary fix I set:

setParameter:
  internalQueryExecMaxBlockingSortBytes: 103554432

in /etc/mongod.conf

commented

This might be useful to add to the troubleshooting documentation https://docs.stackstorm.com/troubleshooting/database.html

commented

The default sort is set to -start_timestamp and action.ref for the history page, yet there is no compound index. The fix for the history page is to add a compound index to actionexecutions model.

commented

Here is the default history page sort code

Facing same problem. Any solution?

Facing same problem. Any solution?

The workaround for now is to enable the Garbage collector by providing configuration for it.
I set 3 days retention for jobs history, and currently it seems enough

happens with very large output/input action executions on

after eppending these lines to that file, my mongo connection got lost

Facing same problem. Any solution?

The workaround for now is to enable the Garbage collector by providing configuration for it. I set 3 days retention for jobs history, and currently it seems enough

what's your setup / # of action or workflow executions per day? 3 days seems like a very short time