[BUG] Shard failures after upgrade (LME 0.3 --> 0.4)

Question

[BUG] Shard failures after upgrade (LME 0.3 --> 0.4)

joncojonathan opened this issue 3 years ago · comments

After upgrading to v0.4 (following these instructions) some dashboards cause a message that a number of shards have failed:

76 of 128 shards failed
The data you are seeing might be incomplete or wrong.

Clicking the "show details" button provides further information such as:

illegal_argument_exception at shard 0index winlogbeat-05.01.2022-1node 0QCqSPMpTmKe9h4AMqbG_g

Type
    illegal_argument_exception
Reason
    Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [host.name] in order to load field data by uninverting the inverted index. Note that this can use significant memory.

This does not happen for all dashboards, but I've not performed an exhaustive search as to why it sometimes happens. I have noticed that sometimes the additional info references old indexes that have been deleted since the re-index (part of the 0.3 --> 0.4 upgrade).

To Reproduce
Steps to reproduce the behavior:

Upgrade to LME 0.4 from 0.3
Go to menu > Dashboard
Choose Security dashboard
Choose Security log
See error

Expected behavior
Dashboards should load correctly, without error.

Linux Server (please complete the following information):

Docker: 20.10.12, build e91ed57
Docker compose stack file version: [e.g. version 0.1]
Linux: Ubuntu 20.04.3 LTS (issue was also present on Ubuntu 18.04 LTS)
Logstash Version: 0.4 - 24/03/21

Additional context
See also #127 which also occurred after the same upgrade, and may be related.

adam-ncc · Answer 1 · Wed Mar 09 2022 02:59:58 GMT+0800 (China Standard Time)

Hey @joncojonathan, apologies for the delayed response. This looks as though you are encountering the exact errors that the re-indexing is meant to avoid, as the default field type has changed from an analysed text field to a keyword field within LME, to comply with the standard from Elastic's Common Schema (ECS).

It's possible that either the new mapping file did not deploy correctly, or that you otherwise have old data that wasn't successfully re-indexed and is now causing issues as there are two types of data stored within the same field and Elasticsearch doesn't know how to query it.

If you go to Stack Management -> Index Management -> Index Templates -> lme_template -> Mappings, then you should be able to verify that the mapping there roughly matches up with file available here (there may be one or two minor differences depending on your WLB and LME version). The main thing you're looking for is that most fields should be of a "type" of "keyword" with "ignore_above" set to 1024, if this is significantly different you may need to call the upgrade script and re-index again as the mapping file has not applied for some reason.

If the mapping looks pretty much as you'd expect it to, you can check the mapping which was actually applied to your existing indices at the time of write by going into the Dev Tools console and searching for the following:

GET winlogbeat-*/_mapping/field/host.name

This should show you every instance of the host.name field's mappings across all of the winlogbeat-* indices in Elasticsearch - again, you're looking for it to look something like this, with one for each index:

You'll need to make sure that all of the indices match that mapping type, if some of them show the type of "text" as their type and some show "keyword" then this would explain the mismatch, and you'll need to re-index or otherwise delete the ones showing as "text".

Let me know how you get on with this, and whether this resolves your issue. I suspect this will also be what's causing your issue in #127, so hopefully fixing this will resolve both problems.

Jonathan Haddock · Answer 2 · Mon Mar 21 2022 22:30:41 GMT+0800 (China Standard Time)

Hi @adam-ncc, thanks for the pointers which I've started to look into today.

It appears I was missing the mappings for lme_template (there was nothing under "mapped fields" at Stack Management -> Index Management -> Index Templates -> lme_template -> Mappings). I also ran GET winlogbeat-*/_mapping/field/host.name which showed as below, so was clearly wrong:

  "winlogbeat-28.02.2022" : {
    "mappings" : {
      "host.name" : {
        "full_name" : "host.name",
        "mapping" : {
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  },

At this point I took a VM snapshot so I could tinker. I'll outline my steps below in case others have the same issue.

cd /opt/lme/Chapter 3 files
sudo ./deploy.sh update which completes correctly
sudo ./dashboard_update.sh which dumps JSON to the shell, but begins {"successCount":234,"success":true,"warnings":[],"successResults
Browse to Security -> Detect -> Rules and click to "Update 172 Elastic prebuilt rules"
Filter the rules by tag "Windows"
Click "Select all 290 rules" from above the rule list
Click "Bulk actions" then "Activate Selected"
There's still no mappings under the lme_template at that point
Back in the shell, run sudo deploy.sh upgrade (note upgrade not update)
Mappings now show under the lme_template
Go to the developer console (https://YOUR-LME-KIBANA-INSTANCE/app/dev_tools#/console) and run the re-indexing per the file at /lme/docs/painless-reindex.txt

I'm now waiting for that to complete, whereupon I'll feedback my findings. Already it's looking positive though. This makes me wonder if I misread / missed a step during the original upgrade.

Jonathan Haddock · Answer 3 · Mon Mar 21 2022 23:09:48 GMT+0800 (China Standard Time)

Hello again,

I can confirm that has resolved the issue after the duplicate, old, indices (those without -1 in the name) were deleted. Seems the full process for my fix is (slightly re-ordered from above, as I suspect the below makes more sense):

cd /opt/lme/Chapter 3 files
sudo deploy.sh upgrade
sudo ./dashboard_update.sh
Browse to Security -> Detect -> Rules and click to "Update 172 Elastic prebuilt rules"
Filter the rules by tag "Windows"
Click "Select all 290 rules" from above the rule list
Click "Bulk actions" then "Activate Selected"
Browse to Stack Management -> Index Management -> Index Templates -> lme_template -> Mappings and confirm mappings are present
Go to the developer console (https://your-lme-kibana-instance/app/dev_tools#/console) and run the re-indexing per the file at /lme/docs/painless-reindex.txt

I hope this helps anyone else with the same issue. Many thanks for the help, I'll close this now.