nickholbrook / logging-pipeline

Securing DevOps' logging pipeline, a dev environment for Mozilla Hindsight

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Logging Pipeline demo with Hindsight

This repository contains sample code to run experiments and demos with hindsight. You can run the code here using the mozilla/hindsight docker container, as follows:

$ docker pull mozilla/hindsight

$ chmod 777 output run -R

$ docker run -it \
    -v $(pwd)/cfg:/app/cfg \
    -v $(pwd)/logs:/app/logs \
    -v $(pwd)/run:/app/run \
    -v $(pwd)/output:/app/output \
    mozilla/hindsight

This will mount the local directories into the Docker container and execute hindsight. Resulting data will be put into the output directory.

When developing, plugins should be put in the run directory.

Tutorial

Hindsight is the successor of Heka, a high performance log processing pipeline built to process logs and telemetry from Firefox Services. Heka was written in Go and eventually ran into the limits of the runtime. The Heka developers decided to rewrite it into two components: a light weight data processing kernel, written in C, that controls Lua plugins executed inside a sandbox. The project, called Hindsight, is available at github.com/mozilla-services/hindsight.

Hindsight sits behind the streaming layer of the logging pipeline, typically consuming logs from a Kafka broker. It can also be used to read local log files, which is what this tutorial uses. The processing works as follows:

  1. First, hindsight consumes incoming messages coming using input plugins. The plugins can read the data from local files, message brokers or database, anything that can run in the lua sandbox. Fields can be standardized, timestamps converted, etc. Standardized messages are stored in a Heka protocol buffer message and injected into the next layer of processing.

  2. Standardized messages are then forwarded to analysis plugins. Routing and multiplexing happens to allow several plugins to receive a copy of a given message. Plugins run arbitrary code written to achieve very specific tasks: compute statistics, flag events containing a given string, etc. New data created by analysis plugin is injected into the payload buffer for incremental construction of the final payload output.

  3. Output plugins process the payload buffer and write data out to files, database, message brokers, etc.

hindsight data flow

In this repository, input plugins are stored in run/input, analysis plugins in run/analysis and output plugins in run/output.

$ tree run/
run/
├── input
│   ├── input_nginx.cfg               # Input plugin to load
│   └── input_nginx.lua               # an nginx log file
├── analysis
│   ├── counter.cfg                   # Analysis plugin to count
│   ├── counter.lua                   # log entries
│   ├── suspicious_signatures.cfg     # Analysis plugin to detect
│   └── suspicious_signatures.lua     # suspicious signatures
└── output
    ├── heka_debug.cfg                # Output plugin to print debug data while running Hindsight
    └── heka_inject_payload.cfg       # Output plugin to write output data to a local file

When running the docker container of hindsight with the local directories mounter, global configuration is loaded from cfg/hindsight.cfg. Samples logs files are taken from logs/ and output plugins write their data to output/.

$ tree -L 1
.
├── cfg
├── logs
├── output
└── run

Let's take a quick look at some of these files to understand how Hindsight uses them. Our input is an Nginx access log file stored under logs/nginx_access.log. The configuration at run/input/input_nginx.cfg instructs the plugin at run/input/input_nginx.lua to read this file, standardizes its fields and passes each log entry over to the analysis layer.

$ cat run/input/input_nginx.cfg 
filename          = "input_nginx.lua"
input_file        = "logs/nginx_access.log"
instruction_limit = 0

The Lua source code of the plugin can be seen here. It reads the logs file line by line and parses each line using a custom grammar configured to understand the Nginx log format. The parser uses a Lua library called lpeg which transforms a log line into a map of fields. The map is then stored into a Hindsight message and injected into the analysis queue.

require "io"

local clf = require "lpeg.common_log_format"

local msg = {
Timestamp   = nil,
Type        = "logfile",
Hostname    = "localhost",
Logger      = "nginx",
Payload     = nil,
Fields      = nil
}

local grammar = clf.build_nginx_grammar('$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"')
local cnt = 0;
local fn = read_config("input_file")

function process_message()
    local fh = assert(io.open(fn, "rb"))

    for line in fh:lines() do
        local fields = grammar:match(line)
        if fields then
            msg.Timestamp = fields.time
            fields.time = nil
            msg.Fields = fields
            inject_message(msg, fh:seek())
            cnt = cnt + 1
        end
    end
    fh:close()

    return 0, tostring(cnt)
end

Hindsight takes care of forwarding the message to the next layer, where analysis plugins will perform further work. In an environment that processes many different types of messages, a routing operation must happen to allows analysis plugins to only receive the type of message they care about. Let's take a look at the counter analysis plugin, whose only task is to count the number of messages that pass through it. Its configuration file can be seen here. Note the message_matcher directive in this file. It contains a matching rule that gets applied to every message entering the analysis queue of Hindsight.

filename        = "counter.lua"
message_matcher = "Logger == 'nginx' && Type == 'logfile' && Fields[request] =~ '^GET ' && Fields[remote_addr] != '172.21.0.2'"
ticker_interval = 5

When a message matches the rule, Hindsight sends it for processing to the plugin located here.

The source code of the counter analyzer couldn't be more simple. That analyzer only counts messages it receives into a msgcount variable and periodically publishes the latest total over to the output queue through the inject_payload function.

The timer_event function is only executed periodically, as defined by the ticker_interval set in the plugin configuration. In our case, it will run every 5 seconds.

require "string"
msgcount = 0

function process_message()
    msgcount = msgcount + 1
    return 0
end

function timer_event()
    inject_payload("txt", "count", string.format("%d message analysed", msgcount))
end

When the counter plugin injects a payload, Hindsight forwards that payload to the output queue. We're in the last part of the processing logic, where plugins take data and write it to a destination. Here again, an output plugin takes a configuration file and a Lua file. This is where you'd want to write a plugin that inserts events into a database, or sends email to people. For development purpose, we will limit ourselves to output plugins that write data to disk, like the heka_inject_payload plugin that is provided with Hindsight.

As defined by its configuration, this output plugin will receive payloads injected by the analysis plugins and write them under the output/payload directory, effectively storing a count of Nginx logs whose request and remote IP addresses matched the filter of the counter plugin.

$ cat run/output/heka_inject_payload.cfg 
filename        = "heka_inject_payload.lua"
message_matcher = "Type == 'inject_payload'"
output_dir      = "output/payload"

The output can be seen in output/payload/analysis.counter.count.txt.

$ cat output/payload/analysis.counter.count.txt 
1731 message analysed

This is it for a quick overview of how Hindsight works. For more complex analyzers, take a look at the content of the run/analysis directory. For extra documentation, refer to these links:

About

Securing DevOps' logging pipeline, a dev environment for Mozilla Hindsight

License:Mozilla Public License 2.0


Languages

Language:Lua 100.0%