revisit existing patterns to add more type casting

Question

revisit existing patterns to add more type casting

jsvd opened this issue 8 years ago · comments

This issue's purpose is to open the discussion to add more casting to existing patterns.

Take for example the following pattern:

patterns/firewalls:CISCOFW313005 %{CISCO_REASON:reason} for %{WORD:protocol} error message: %{WORD:err_protocol} src %{DATA:err_src_interface}:%{IP:err_src_ip}(\(%{DATA:err_src_fwuser}\))? dst %{DATA:err_dst_interface}:%{IP:err_dst_ip}(\(%{DATA:err_dst_fwuser}\))? \(type %{INT:err_icmp_type}, code %{INT:err_icmp_code}\) on %{DATA:interface} interface\.  Original IP payload: %{WORD:protocol} src %{IP:orig_src_ip}/%{INT:orig_src_port}(\(%{DATA:orig_src_fwuser}\))? dst %{IP:orig_dst_ip}/%{INT:orig_dst_port}(\(%{DATA:orig_dst_fwuser}\))?

There are multiple INT patterns which aren't cast to integer, which forces users to add mutate filters after applying the grok pattern.

I don't think we can simply force all INT patterns to be converted to ints and all FLOAT's to float, but we can audit existing composed patterns and add :int / :float to some.

For example, transform the pattern above into:

patterns/firewalls:CISCOFW313005 %{CISCO_REASON:reason} for %{WORD:protocol} error message: %{WORD:err_protocol} src %{DATA:err_src_interface}:%{IP:err_src_ip}(\(%{DATA:err_src_fwuser}\))? dst %{DATA:err_dst_interface}:%{IP:err_dst_ip}(\(%{DATA:err_dst_fwuser}\))? \(type %{INT:err_icmp_type:int}, code %{INT:err_icmp_code:int}\) on %{DATA:interface} interface\.  Original IP payload: %{WORD:protocol} src %{IP:orig_src_ip}/%{INT:orig_src_port:int}(\(%{DATA:orig_src_fwuser}\))? dst %{IP:orig_dst_ip}/%{INT:orig_dst_port:int}(\(%{DATA:orig_dst_fwuser}\))?

Antonio Bonuccelli · Answer 1 · Thu Sep 22 2016 19:48:54 GMT+0800 (China Standard Time)

indeed a step forward @jsvd thanks for starting this.

it'd be great to have some way to have logstash attempt to recognize the type a-la elasticsearch however I ignore how doable that'd be

Antonio Bonuccelli · Answer 2 · Thu Sep 22 2016 22:28:38 GMT+0800 (China Standard Time)

@jsvd

just found this very interesting feature (never seen this before!!???) https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html#numeric-detection

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}

PUT my_index/my_type/1
{
  "my_float":   "1.0", 
  "my_integer": "1",
  "my_string": "blablabal"
}

GET my_index/_mapping

{
  "my_index": {
    "mappings": {
      "my_type": {
        "numeric_detection": true,
        "properties": {
          "my_float": {
            "type": "double"
          },
          "my_integer": {
            "type": "long"
          },
          "my_string": {
            "type": "string"
          }
        }
      }
    }
  }
}

could be the solution?

João Duarte · Answer 3 · Thu Sep 22 2016 22:41:44 GMT+0800 (China Standard Time)

it looks like a very interesting solution indeed!

The only issues I can see are:

fields that contain base 16 numbers (where you could first receive a hash 151612 and then a31f2c1)
or fields you know are numbers but want to treat as strings like protocol version numbers (maybe?)

Antonio Bonuccelli · Answer 4 · Thu Sep 22 2016 22:57:36 GMT+0800 (China Standard Time)

@jsvd If you ask me I think this should be enabled by default as these ones you mention are the minority of the cases.

Antonio Bonuccelli · Answer 5 · Fri Sep 23 2016 16:33:28 GMT+0800 (China Standard Time)

at least for a logging use case

Jordan Sissel · Answer 6 · Sat Oct 15 2016 06:33:13 GMT+0800 (China Standard Time)

@jsvd Id' be open to having grok have a set of patterns it knows should be converted (int, float).

Matt · Answer 7 · Thu Oct 12 2017 11:39:06 GMT+0800 (China Standard Time)

Hi, I need these type casts, so have created #211 .

Matt · Answer 8 · Thu Oct 12 2017 12:53:17 GMT+0800 (China Standard Time)

If we use Logstash's type-casting feature to convert numeric fields back into numbers before input to Elasticsearch, then there's no need to explicitly configure field types in its mappings. We're simply undoing the fprintf(logfile, "%d", ...) serialisation that the original daemon performed.

Ideally, Logstash would only use patterns like FLOAT to parse actual floating-point numbers (not things that look like floats but aren't, such as HTTP versions). Then (eg) NONNEGINT:foo:int would be a tautology and a violation of the Don't Repeat Yourself principle, that could eventually be removed.

Jordan Sissel · Answer 9 · Thu Oct 12 2017 13:46:07 GMT+0800 (China Standard Time)

@MattSANU You can use the mutate filter today to achieve what you are trying to do. Not to suggest that this is the best solution, though.

Making such a change is a good thing, but because it changes the basic schema of the data Logstash provides, it would be a pretty significant breaking change. The breaking-change aspect of this is why I haven't attended to this problem myself -- it's a big change, a useful change, but I am afraid of the damage it will cause to users. Let's figure out how to minimize/reduce/resolve the possible damages before moving forward with this.

Matt · Answer 10 · Fri Oct 13 2017 10:02:07 GMT+0800 (China Standard Time)

Forgive the dumb question, but what would this break? In my (very limited) experience so far, it's solely the data type in Elasticsearch's indices that matters. It seems to be irrelevant whether the incoming JSON data expresses numeric data as numbers or as strings.

Use of mutate filters seems prone to bugs. One would have to enumerate in one's mutate filters all of the numeric fields being parsed by grok filters, and keep the two lists of fields in sync forevermore.

Matt · Answer 11 · Fri Oct 13 2017 10:13:37 GMT+0800 (China Standard Time)

Ideally, nobody would depend on numeric data being expressed as strings. If people are doing that, then ideally they should fix that before they upgrade Logstash to a version containing additional type casts like these.
If a user was depending on numeric data being expressed as strings, and didn't want to fix that, and did want to upgrade Logstash, and we had landed a change like this, then couldn't such a person work around the broken-ness using mutate filters?

Karol Bucek · Answer 12 · Wed Feb 17 2021 21:07:17 GMT+0800 (China Standard Time)

closing - we now have type-casting in ECS version of the patterns (from #297)