logstash-plugins / logstash-patterns-core

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

revisit existing patterns to add more type casting

jsvd opened this issue · comments

This issue's purpose is to open the discussion to add more casting to existing patterns.

Take for example the following pattern:

patterns/firewalls:CISCOFW313005 %{CISCO_REASON:reason} for %{WORD:protocol} error message: %{WORD:err_protocol} src %{DATA:err_src_interface}:%{IP:err_src_ip}(\(%{DATA:err_src_fwuser}\))? dst %{DATA:err_dst_interface}:%{IP:err_dst_ip}(\(%{DATA:err_dst_fwuser}\))? \(type %{INT:err_icmp_type}, code %{INT:err_icmp_code}\) on %{DATA:interface} interface\.  Original IP payload: %{WORD:protocol} src %{IP:orig_src_ip}/%{INT:orig_src_port}(\(%{DATA:orig_src_fwuser}\))? dst %{IP:orig_dst_ip}/%{INT:orig_dst_port}(\(%{DATA:orig_dst_fwuser}\))?

There are multiple INT patterns which aren't cast to integer, which forces users to add mutate filters after applying the grok pattern.

I don't think we can simply force all INT patterns to be converted to ints and all FLOAT's to float, but we can audit existing composed patterns and add :int / :float to some.

For example, transform the pattern above into:

patterns/firewalls:CISCOFW313005 %{CISCO_REASON:reason} for %{WORD:protocol} error message: %{WORD:err_protocol} src %{DATA:err_src_interface}:%{IP:err_src_ip}(\(%{DATA:err_src_fwuser}\))? dst %{DATA:err_dst_interface}:%{IP:err_dst_ip}(\(%{DATA:err_dst_fwuser}\))? \(type %{INT:err_icmp_type:int}, code %{INT:err_icmp_code:int}\) on %{DATA:interface} interface\.  Original IP payload: %{WORD:protocol} src %{IP:orig_src_ip}/%{INT:orig_src_port:int}(\(%{DATA:orig_src_fwuser}\))? dst %{IP:orig_dst_ip}/%{INT:orig_dst_port:int}(\(%{DATA:orig_dst_fwuser}\))?

indeed a step forward @jsvd thanks for starting this.

it'd be great to have some way to have logstash attempt to recognize the type a-la elasticsearch however I ignore how doable that'd be

@jsvd

just found this very interesting feature (never seen this before!!???) https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html#numeric-detection

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}

PUT my_index/my_type/1
{
  "my_float":   "1.0", 
  "my_integer": "1",
  "my_string": "blablabal"
}

GET my_index/_mapping

{
  "my_index": {
    "mappings": {
      "my_type": {
        "numeric_detection": true,
        "properties": {
          "my_float": {
            "type": "double"
          },
          "my_integer": {
            "type": "long"
          },
          "my_string": {
            "type": "string"
          }
        }
      }
    }
  }
}

could be the solution?

it looks like a very interesting solution indeed!

The only issues I can see are:

  • fields that contain base 16 numbers (where you could first receive a hash 151612 and then a31f2c1)
  • or fields you know are numbers but want to treat as strings like protocol version numbers (maybe?)

@jsvd If you ask me I think this should be enabled by default as these ones you mention are the minority of the cases.

at least for a logging use case

@jsvd Id' be open to having grok have a set of patterns it knows should be converted (int, float).

commented

Hi, I need these type casts, so have created #211 .

commented

If we use Logstash's type-casting feature to convert numeric fields back into numbers before input to Elasticsearch, then there's no need to explicitly configure field types in its mappings. We're simply undoing the fprintf(logfile, "%d", ...) serialisation that the original daemon performed.

Ideally, Logstash would only use patterns like FLOAT to parse actual floating-point numbers (not things that look like floats but aren't, such as HTTP versions). Then (eg) NONNEGINT:foo:int would be a tautology and a violation of the Don't Repeat Yourself principle, that could eventually be removed.

@MattSANU You can use the mutate filter today to achieve what you are trying to do. Not to suggest that this is the best solution, though.

Making such a change is a good thing, but because it changes the basic schema of the data Logstash provides, it would be a pretty significant breaking change. The breaking-change aspect of this is why I haven't attended to this problem myself -- it's a big change, a useful change, but I am afraid of the damage it will cause to users. Let's figure out how to minimize/reduce/resolve the possible damages before moving forward with this.

commented

Forgive the dumb question, but what would this break? In my (very limited) experience so far, it's solely the data type in Elasticsearch's indices that matters. It seems to be irrelevant whether the incoming JSON data expresses numeric data as numbers or as strings.

Use of mutate filters seems prone to bugs. One would have to enumerate in one's mutate filters all of the numeric fields being parsed by grok filters, and keep the two lists of fields in sync forevermore.

commented

Ideally, nobody would depend on numeric data being expressed as strings. If people are doing that, then ideally they should fix that before they upgrade Logstash to a version containing additional type casts like these.
If a user was depending on numeric data being expressed as strings, and didn't want to fix that, and did want to upgrade Logstash, and we had landed a change like this, then couldn't such a person work around the broken-ness using mutate filters?

closing - we now have type-casting in ECS version of the patterns (from #297)