System Log messages should include fraction of second

Question

System Log messages should include fraction of second

youngm opened this issue 6 years ago · comments

I have an idea for a new feature - please document as "As a user, I would
like to..."

As an operator of a CF deployment I would like system log messages logged by loggregator components to have fractions of seconds.

My log aggregation tools prefer log message timestamp over syslog timestamp because log message timestamp is typically more accurate to the millisecond. However, loggregator components don't log fraction of a second which disrupts this model when comparing loggregator component logs against other cf components.

Log message timestamp example today: 2018/05/17 16:00:19.

cf-gitbot · Answer 1 · Fri May 18 2018 00:06:03 GMT+0800 (China Standard Time)

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/157668501

The labels on this github issue will be updated when the story is started.

Johanna Ratliff · Answer 2 · Tue May 22 2018 06:36:20 GMT+0800 (China Standard Time)

@youngm The loggregator-api envelope does include timestamp in nanoseconds. Are you referring to the timestamp when you are consuming via cf logs? or perhaps the go-loggregator client?

Mike Youngstrom · Answer 3 · Tue May 22 2018 06:53:40 GMT+0800 (China Standard Time)

@JohannaSmith I'm talking about component logs that get created in /var/vcap/sys/log and pickup with the syslog-release for system administrators to use to debug potential issues with loggregator components.

Johanna Ratliff · Answer 4 · Thu May 31 2018 05:42:40 GMT+0800 (China Standard Time)

@youngm Ah. Looking at the logs that do emit partial seconds, it's a very different approach to logging. Those bosh logs have gone with a structured log approach which enables this. For loggregator, we found that relying heavily on structured logs can cause the following:

high throughput on the system due to more reliance on logs
less readability

We aren't planning on adding fractional seconds to our current log format.
We've tried to transition to heavy reliance on metrics. Is there a missing metric you could have used in this scenario that we could add?

Mike Youngstrom · Answer 5 · Thu May 31 2018 05:59:40 GMT+0800 (China Standard Time)

@JohannaSmith The biggest issue for me is attempting to correlate cross component events or errors when debugging issues. Doing so is much easier with fractions of a second in log messages.

For example, say I'm attempting to diagnose an intermittent issue with a syslog drain. My drain server may be logging events in fractions of a second. I'd like to see if a particular error on my drain matches some kind of error on the adapter at the same time potentially helping me discover the problem. If logging at second granularity it makes it harder to correlate issues between components.

I'm not asking for structured log messages. Just more granular time signatures when loggregator does decide to log one of its unstructured messages.

Jason Keene · Answer 6 · Thu May 31 2018 20:19:48 GMT+0800 (China Standard Time)

Looks like ‘log.Lmicroseconds’ could be tacked on to the log pkg flags:

https://golang.org/pkg/log/#pkg-constants

It doesn’t add much noise to the output and is still human readable vs other bosh components that just use unix timestamp.

Mike Youngstrom · Answer 7 · Thu May 31 2018 23:17:27 GMT+0800 (China Standard Time)

I should probably quit while I'm ahead. But, if it isn't much more trouble a nice ISO 8601 format is easier for splunk and other log aggregators to parse. For example: 2018-05-31T15:14:42.339Z

But, this would just be icing on the cake. :) I'd also be perfectly happy to just have sub seconds in the current format. Thanks @jasonkeene and @JohannaSmith

Jason Keene · Answer 8 · Fri Jun 01 2018 00:41:25 GMT+0800 (China Standard Time)

Yeah, I don't see support for ISO 8601 in the log pkg constants. I think sub seconds is a happy middle-ground. Like @JohannaSmith said if there are any metrics we can export that would help you in troubleshooting your issue please post them. We want to encourage folks not to rely on logs for debugging.

Todd Persen · Answer 9 · Fri Jun 29 2018 05:31:01 GMT+0800 (China Standard Time)

@youngm We're working on this now. Can you tell us more specifically which component logs you're referring to? Are you saying that the problem exists before the syslog-release picks up the logs or after?

Mike Youngstrom · Answer 10 · Fri Jun 29 2018 05:37:53 GMT+0800 (China Standard Time)

@toddboom I'm looking for subseconds in the actual log message before being picked up by the syslog-release

It seems pretty much all of the components produced by the logging and metrics team has this issue. Here are the ones I use most interested in having changed.

Metron
Doppler
Traffic Controller
Reverse Log Proxy
Adapter
Scheduler

Todd Persen · Answer 11 · Fri Jun 29 2018 05:43:00 GMT+0800 (China Standard Time)

@youngm Thanks! That looks like the list I was putting together, but I just wanted to make sure we were on the same page. I'll get cracking on those and follow up here once it's done.

Mike Youngstrom · Answer 12 · Fri Jun 29 2018 05:46:23 GMT+0800 (China Standard Time)

@toddboom We don't yet use log cache but I'm sure we will so don't forget about that one. :)

Todd Persen · Answer 13 · Sat Jun 30 2018 05:40:10 GMT+0800 (China Standard Time)

Ok, these commits should take care of it in pretty much everything I can think of:

They should be included in the next releases of each product.

Mike Youngstrom · Answer 14 · Sat Jun 30 2018 06:04:04 GMT+0800 (China Standard Time)

Looks great! Thanks @toddboom!