SearchHeadLevel - Detect MongoDB errors - Possible Bug

Question

SearchHeadLevel - Detect MongoDB errors - Possible Bug

deepak-kosaraju opened this issue 5 years ago · comments

@gjanders
Not sure if you noticed this, some how SHL - Detect MongoDB errors matching events from Wrong Host. I did ran the both searches individually for same time window and both never showed the results but when combined together they produced weird result(attached). This is the 1st time this happened which alerted OnCall and found there could be something wrong.

BTW: There was issue with one of our Index Peer when Search was scheduled, but that shouldn't give false results like what we noticed with this alert.

I can also upload the search.log if needed to troubleshoot this issue.

SplunkAdmin App version: 2.3.5

03-10-2019 15:53:02.895 INFO  UnifiedSearch - Expanded index search = "" "" index=_internal host=p*splunkhead*.<internal domain> source=/opt/splunk/var/log/splunk/mongod.log ( " E " OR " F " ) NOT "SSL: error"
-- | --

My current settings:

<index peer>$ sudo /opt/splunk/bin/splunk btool distsearch list
[distributedSearch]
authTokenConnectionTimeout = 5
authTokenReceiveTimeout = 10
authTokenSendTimeout = 10
bestEffortSearch = false
connectionTimeout = 10
defaultUriScheme = https
disabled = false
receiveTimeout = 600
sendTimeout = 30
serverTimeout = 10
servers =
shareBundles = true
statusTimeout = 10
useSHPBundleReplication = true
...
...

<index peer>$ sudo /opt/splunk/bin/splunk btool limits list  | grep 'keepalive'
search_keepalive_frequency = 30000
search_keepalive_max = 100

## from limits.conf doc
# Specifies how often, in milliseconds, a keepalive is sent while a search is running.
# Default: 30000 (30 seconds)
search_keepalive_frequency = 30000

# The maximum number of uninterupted keepalives before the connection is closed.
# This counter is reset if the search returns results. 
# Default: 100
search_keepalive_max = 100

gjanders · Answer 1 · Tue Mar 12 2019 04:29:18 GMT+0800 (China Standard Time)

Hi Deepak,

I believe you have stumbled upon a known Splunk issue, the tstats command searches the metadata from Splunk, however Splunk does not discriminate between the "host" field and host:: in the raw data.

I logged a support case about this and it ended up an "enhancement request" to change the behaviour, effectively if any of the raw data in _internal contains the host:: then the tstats command may return this as the host.

I should be able to correct this by adding a | search host=... after the initial tstats command but I will test and confirm.

 `comment("The main goal of this alert errors which might not appear in splunkd.log but are critical to keeping the kvstore running on the search heads. Please check the mongod.log file for further information, the additional count field is simply determining that mongo is still logging...")`\
`comment("Attempt to find errors in the mongod log and make sure the errors do not relate to shutdown events in the search head cluster. Since this does will ignore any events when either cluster shutsdown it might not be sensitive enough for some use cases...")`\
index=_internal `searchheadhosts` `splunkadmins_mongo_source` (" E " OR " F " OR " W ") `splunkadmins_mongodb_errors`\
| search `comment("Exclude time periods where shutdowns were occurring")` AND NOT [`splunkadmins_shutdown_time(searchheadhosts,60,60)`]\
| eventstats max(_time) AS mostRecent, min(_time) AS firstSeen by host\
| bin _time span=10m \
| stats values(_raw) AS logMessages, max(mostRecent) AS mostRecent, min(firstSeen) AS firstSeen by _time, host \
| search `comment("One final symptom that appears when mongodb is dead is the logging just stops, zero data, however this proved to be tricky in Splunk so the below query uses a few tricks to ensure the data will show zero values even if the server stops reporting. timechart was recommended by splunkanswers as it creates a timebucket with null values if no data is found...")`\
| append \
    [ tstats count where index=_internal `searchheadhosts` `splunkadmins_mongo_source` by host, _time span=5m \
    | search `searchheadhosts`\
    | timechart limit=0 span=5m sum(count) AS count by host \
    | fillnull count\
    | untable _time, host, count \
    | stats max(_time) AS mostRecent, min(_time) AS firstSeen, last(count) AS lastCount by host \
    | where lastCount=0 \
    | eval logMessages="Zero log entries found at this time, mongod might not be running, please investigate" \
    | fields - lastCount] \
| eval mostRecent = strftime(mostRecent, "%+"), firstSeen=strftime(firstSeen, "%+")\
| fields _time, host, firstSeen, mostRecent, logMessages\
| search `comment("Just in case...")` `splunkadmins_mongodb_errors2`

Or similar, the only part that would change is the extra "search" in there, I will test in the next day or two to make sure.

gjanders · Answer 2 · Tue Mar 12 2019 05:25:41 GMT+0800 (China Standard Time)

@gdv-deepakk in the tstats section of the search:
[ tstats count where index=_internal `searchheadhosts` `splunkadmins_mongo_source` by host, _time span=5m

Can you add this line afterwards?
| search `searchheadhosts`

Also the fillnull should have the keyword count on it:
| fillnull count

I believe that will fix the issue but you would have to run the alert over the timerange where you saw the problem.

FYI the issue only occurs when host:: appears in the raw log data, Splunk indexes that as a keyword and therefore the tstats does not work as expected!
Let me know if you can replicate the issue and if the above update of adding the search searchheadhosts fixes the issue, as it is not the easiest issue to replicate...

Deepak Kosaraju · Answer 3 · Tue Mar 12 2019 14:30:14 GMT+0800 (China Standard Time)

@gjanders
Thanks for quick response, following did the trick. Thanks for detail explanation and time on this request.

| search `searchheadhosts`

but adding count keyword for |fillnull is not get same results as adding just | search suggested above..

gjanders · Answer 4 · Tue Mar 12 2019 15:17:48 GMT+0800 (China Standard Time)

@gdv-deepakk in retrospect I should not be changing the fillnull command at all for this search. I will add in the | search part into the next version.
I double-checked my previous support case and confirmed that any case where host:: appears in the raw log data then tstats may show that in the results...

The workarounds offered were to change the segmenters.conf (to make : and :: major segmenters) or add an additional where clause (or search clause as I have done).

gjanders · Answer 5 · Tue Mar 12 2019 18:45:37 GMT+0800 (China Standard Time)

Updated the testing branch https://github.com/gjanders/SplunkAdmins/tree/testing with the new release to fix this issue, will release this in April most likely...

gjanders · Answer 6 · Mon Apr 29 2019 19:34:18 GMT+0800 (China Standard Time)

Released version 2.5.0 which includes this fix.
Thanks for reporting this issue