Apache, nginx, CloudFront and S3 logs parser & transporter to ClickHouse databases, based on nginx-clickhouse.
For the past 20+ years, I've been generating traffic reports based on my web server's Apache Access Logs, using open-source tools like awstats and jawstats.
Development of those projects has tapered off, and their UIs are not very modern or efficient (storing pre-computed stats in text files). I've been searching for a way of modernizing my log telemetry, and eventually settled on building a data warehouse in ClickHouse and Grafana.
I recently found the nginx-clickhouse project which imports nginx access logs (which are essentially the same as Apache access logs) into ClickHouse. I wanted a bit more flexability to analyze Apache, Amazon CloudFront, and Amazon S3 logs, so have forked the project for my needs.
Improvemnts from nginx-clickhouse:
- Process logs from:
- Apache Access Logs
- nginx Access Logs
- Amazon CloudFront Access Logs
- Amazon S3 Access Logs
- (+any other log that can be processed with simple regexs)
- Ability to run
-once
on a file and exit (bulk loading historical data) - Ability to read from
-stdin
- Ability to specify the
-domain xyz.com
from the command line and/orconfig.yml
- Flexible ClickHouse column definitions and custom parsing based on the column name
- Apache, nginx, CloudFront and S3-focused Grafana dashboards
- Integration of ua-parser to provide Browser and OS stats
- Integration of crawlerdetect and isbot to detect bots
- Integration of maxmind to determine country
make build
go build -a -o apache-clickhouse.exe
To build image just type the command below, and it will compile binary from sources and create Docker image.
You don't need to have Go development tools, the build process will be in Docker.
make docker
apache-clickhouse utilizes the free MaxMind GeoLite2 Country database to tag countries based on the IP address.
The MaxMind download can be put in to data\GeoLite2-Country.mmdb
or as apache-clickhouse.geolite2-country.mmdb
.
apache-clickhouse utilizes the ua-parser regexs.yaml
to break down the User-Agent
string into Browser and OS names and Major versions.
The ua-parser YAML download can be put in to data\uaparser.yml
or as apache-clickhouse.uaparser.yml
.
By default, apache-clickhouse will monitor the specified access log, and import any new lines to ClickHouse on a regular basis (specified by settings.interval
). This should be used for live server logs (e.g. still being appended to).
Alternatively, you can have apache-clickhouse run -once
(or settings.once=true
) to parse a file and exit. This can be used to bulk load historical data (e.g. files that are no longer changing).
This project assumes you'll be loading in data for multiple domains into the same ClickHouse tables, so you can specify -domain [domain.xyz]
in the command line to differentiate the source of each log. If clickhouse.columns.domain
is missing from the config.yml
this isn't necessary.
apache-clickhouse -config_path [config.yml or path] [-once] -log_path [path] -domain [test.com]
In the container:
/apache-clickhouse
is the binary/config.yml
is the default config file location/logs/access_log
is the default log to read from
These can be changed by running with a different command line or environment variables.
docker pull nicjansma/apache-clickhouse
docker run --rm --name apache-clickhouse -v ${PWD}/logs:/logs -v ${PWD}/config.yml:/config.yml nicjansma/apache-clickhouse
# or full command
docker run --rm --name apache-clickhouse -v ${PWD}/logs:/logs -v ${PWD}/config.yml:/config.yml nicjansma/apache-clickhouse /apache-clickhouse -config_path /config.yml -log_path /logs/access_log
The configuration is specified in config.yml
or via an alternative file on the command line as -config [path.yml]
.
A sample configuration file is provided in config-sample.yml
in this repository.
Each log type will need a different log.format
specified in config.yml
. Example formats can be used from config-sample.yml
.
Apache access logs are configured in the Apache config via the LogFormat
directive. The Combined Log Format is commonly used, though other formats should work via updated regular expression rules in config.yml
.
Example:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
CustomLog log/access_log combined
In nginx, nginx_http_log_module configures request logs.
Example:
http {
...
log_format main '$remote_addr - $remote_user [$time_local] "$request" $status $bytes_sent "$http_referer" "$http_user_agent"';
...
}
The site then specifies the access_log
using this main
format:
server {
...
access_log /var/log/nginx/my-site-access.log main;
...
}
Amazon CloudFront Access Logs can be configured via the Amazon Developer console.
The standard CloudFront log format can be parsed, though log.optional_fields=true
is suggested in the config.yml
to allow for fields that may be added or removed over time.
Amazon S3 Access Logs can be configured via the Amazon Developer console.
The standard S3 log format can be parsed, though log.optional_fields=true
is suggested in the config.yml
to allow for fields that may be added or removed over time.
If logs from other applications or services are formatted with space, tab or CSV/TSV deliminators, it should be possible for this project to parse them.
Each supported log type may have different fields and those fields can be mapped to columns in ClickHouse.
Below are some suggested schemas for each log type.
CREATE TABLE metrics.apache_logs (
domain LowCardinality(String),
remote_addr IPv4,
time_local DateTime,
date Date DEFAULT toDate(current_timestamp()),
method LowCardinality(String),
url String,
url_extension LowCardinality(String),
http_version LowCardinality(String),
status UInt16,
body_bytes_sent UInt32,
referrer_domain String,
user_agent_family LowCardinality(String),
user_agent_major LowCardinality(String),
os_family LowCardinality(String),
os_major LowCardinality(String),
device_family LowCardinality(String),
country LowCardinality(String),
bot Boolean
) ENGINE = MergeTree()
PARTITION BY (domain, toYYYYMM(date))
ORDER BY (domain, date, status)
;
CREATE TABLE metrics.nginx_logs (
domain LowCardinality(String),
remote_addr IPv4,
time_local DateTime,
date Date DEFAULT toDate(current_timestamp()),
method LowCardinality(String),
url String,
url_extension LowCardinality(String),
http_version LowCardinality(String),
status UInt16,
body_bytes_sent UInt32,
referrer_domain String,
user_agent_family LowCardinality(String),
user_agent_major LowCardinality(String),
os_family LowCardinality(String),
os_major LowCardinality(String),
device_family LowCardinality(String),
country LowCardinality(String),
bot Boolean
) ENGINE = MergeTree()
PARTITION BY (domain, toYYYYMM(date))
ORDER BY (domain, date, status)
;
CREATE TABLE metrics.cloudfront_logs (
domain LowCardinality(String),
remote_addr IPv6,
time_local DateTime,
date Date DEFAULT toDate(current_timestamp()),
cluster LowCardinality(String),
distribution LowCardinality(String),
protocol LowCardinality(String),
ssl_protocol LowCardinality(String),
ssl_cipher LowCardinality(String),
http_host LowCardinality(String),
method LowCardinality(String),
url String CODEC(ZSTD),
url_extension LowCardinality(String),
http_version LowCardinality(String),
status UInt16,
response_status LowCardinality(String),
body_bytes_sent UInt32,
request_bytes_received UInt32,
content_type LowCardinality(String),
duration UInt16,
referrer_domain String,
user_agent_family LowCardinality(String),
user_agent_major LowCardinality(String),
os_family LowCardinality(String),
os_major LowCardinality(String),
device_family LowCardinality(String),
country LowCardinality(String),
bot Boolean
) ENGINE = MergeTree()
PARTITION BY (domain, toYYYYMM(date))
ORDER BY (domain, date, status)
CREATE TABLE metrics.s3_logs (
bucket LowCardinality(String),
time_local DateTime,
date Date DEFAULT toDate(current_timestamp()),
remote_addr IPv6,
ssl_protocol LowCardinality(String),
ssl_cipher LowCardinality(String),
http_host LowCardinality(String),
operation LowCardinality(String),
method LowCardinality(String),
url String CODEC(ZSTD),
url_extension LowCardinality(String),
http_version LowCardinality(String),
status UInt16,
body_bytes_sent UInt32,
duration UInt16,
error_code LowCardinality(String),
referrer_domain String,
user_agent_family LowCardinality(String),
user_agent_major LowCardinality(String),
os_family LowCardinality(String),
os_major LowCardinality(String),
device_family LowCardinality(String),
country LowCardinality(String),
bot Boolean
) ENGINE = MergeTree()
PARTITION BY (bucket, toYYYYMM(date))
ORDER BY (bucket, date, status)
settings:
interval: 5 # in seconds
log_path: access_log # path to logfile
seek_from_end: false # start reading from the last line (to prevent duplicates after restart)
once: false # whether to read the file once and exit
domain: test.com # domain name to use
debug: false # debug log level
clickhouse:
db: metrics # Database name
table: apache_logs # Table name
host: localhost # ClickHouse host
port: 8123 # ClicHhouse HTTP port
credentials:
user: default # User name
password: # User password
Based on the chosen log format, you may want different columns parsed from the log and set in the ClickHouse table.
The config-sample.yml
has suggest columns for each log format.
Only the columns set in clickhouse.columns
will be sent to ClickHouse.
columns:
#
# Apache
#
- domain
- remote_addr
- remote_user
- time_local
- date
- method
- url
- url_extension
- http_version
- status
- body_bytes_sent
- referrer
- referrer_domain
- user_agent
- user_agent_family
- user_agent_major
- os_family
- os_major
- device_family
- country
- bot
The log format defines how the log will be parsed.
Examples for Apache, nginx, CloudFront and S3 are in the config-sample.yml
.
# Apache
log:
format: $remote_ip - $remote_user [$time_local] "$method $url" $status $bytes "$http_referer" "$http_user_agent"
For other log formats, it's possible the log parser will work by specifying any fields with $variable_name
as per above. The gonx parser is used, which converts those variables to regular expressions for extraction. By default, all unkown fields will be imported as strings.
Example grafana dashboards are available in the grafana-dashboards/
folder.
Thanks to the nginx-clickhouse project for providing the starting point.