cloudfoundry / loggregator-release

Cloud Native Logging

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

System Log messages should include fraction of second

youngm opened this issue · comments

  • I have an idea for a new feature - please document as "As a user, I would
    like to..."

As an operator of a CF deployment I would like system log messages logged by loggregator components to have fractions of seconds.

My log aggregation tools prefer log message timestamp over syslog timestamp because log message timestamp is typically more accurate to the millisecond. However, loggregator components don't log fraction of a second which disrupts this model when comparing loggregator component logs against other cf components.

Log message timestamp example today: 2018/05/17 16:00:19.

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/157668501

The labels on this github issue will be updated when the story is started.

@youngm The loggregator-api envelope does include timestamp in nanoseconds. Are you referring to the timestamp when you are consuming via cf logs? or perhaps the go-loggregator client?

@JohannaSmith I'm talking about component logs that get created in /var/vcap/sys/log and pickup with the syslog-release for system administrators to use to debug potential issues with loggregator components.

@youngm Ah. Looking at the logs that do emit partial seconds, it's a very different approach to logging. Those bosh logs have gone with a structured log approach which enables this. For loggregator, we found that relying heavily on structured logs can cause the following:

  • high throughput on the system due to more reliance on logs
  • less readability

We aren't planning on adding fractional seconds to our current log format.
We've tried to transition to heavy reliance on metrics. Is there a missing metric you could have used in this scenario that we could add?

@JohannaSmith The biggest issue for me is attempting to correlate cross component events or errors when debugging issues. Doing so is much easier with fractions of a second in log messages.

For example, say I'm attempting to diagnose an intermittent issue with a syslog drain. My drain server may be logging events in fractions of a second. I'd like to see if a particular error on my drain matches some kind of error on the adapter at the same time potentially helping me discover the problem. If logging at second granularity it makes it harder to correlate issues between components.

I'm not asking for structured log messages. Just more granular time signatures when loggregator does decide to log one of its unstructured messages.

Looks like ‘log.Lmicroseconds’ could be tacked on to the log pkg flags:

https://golang.org/pkg/log/#pkg-constants

It doesn’t add much noise to the output and is still human readable vs other bosh components that just use unix timestamp.

I should probably quit while I'm ahead. But, if it isn't much more trouble a nice ISO 8601 format is easier for splunk and other log aggregators to parse. For example: 2018-05-31T15:14:42.339Z

But, this would just be icing on the cake. :) I'd also be perfectly happy to just have sub seconds in the current format. Thanks @jasonkeene and @JohannaSmith

Yeah, I don't see support for ISO 8601 in the log pkg constants. I think sub seconds is a happy middle-ground. Like @JohannaSmith said if there are any metrics we can export that would help you in troubleshooting your issue please post them. We want to encourage folks not to rely on logs for debugging.

@youngm We're working on this now. Can you tell us more specifically which component logs you're referring to? Are you saying that the problem exists before the syslog-release picks up the logs or after?

@toddboom I'm looking for subseconds in the actual log message before being picked up by the syslog-release

It seems pretty much all of the components produced by the logging and metrics team has this issue. Here are the ones I use most interested in having changed.

  • Metron
  • Doppler
  • Traffic Controller
  • Reverse Log Proxy
  • Adapter
  • Scheduler

@youngm Thanks! That looks like the list I was putting together, but I just wanted to make sure we were on the same page. I'll get cracking on those and follow up here once it's done.

@toddboom We don't yet use log cache but I'm sure we will so don't forget about that one. :)

Ok, these commits should take care of it in pretty much everything I can think of:

They should be included in the next releases of each product.

Looks great! Thanks @toddboom!