dwbutler / logstash-logger

Ruby logger that writes logstash events

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Logging Suddenly Stops on Production

pctj101 opened this issue · comments

So my production server is a puma cluster and it logs perfectly... until recently.

Currently, logging is successful for a number of hours, then suddenly stops.

I traced one cause back to Logstash::Event where Event#to_json was throwing errors on encoding to JSON, and usually that was the last log message. So I upgraded Logstash::Event to the latest master branch, fixed a couple logstash-event bugs, and the system is much more stable now.
[ Ref: https://github.com/pctj101/logstash/tree/event_timestamp ]

But... after a few more hours, logging still stops.

So I don't think it's a Logstash::Event#to_json problem any more.

Granted I use lograge to reduce my logs and format into json, but I don't see any problems there either.

And too be fair, I don't see any problems in logstash-logger either.

So it's not clear what to debug.

So while I'm looking into this I thought I'd ask a few simple questions:

Is there anything in general where a bug could cause some exception to be raised, and permanently disable logstash-logger or Rails.logger until the next server restart?

Any similar problems seen in the past?

Thanks!So my production server is a puma cluster and it logs perfectly... until recently.

Currently, logging is successful for a number of hours, then suddenly stops.

I traced one cause back to Logstash::Event where Event#to_json was throwing errors on encoding to JSON, and usually that was the last log message. So I upgraded Logstash::Event to the latest master branch, fixed a couple logstash-event bugs, and the system is much more stable now.
[ Ref: https://github.com/pctj101/logstash/tree/event_timestamp ]

But... after a few more hours, logging still stops.

So I don't think it's a Logstash::Event#to_json problem any more.

Granted I use lograge to reduce my logs and format into json, but I don't see any problems there either.

And too be fair, I don't see any problems in logstash-logger either.

So it's not clear what to debug.

So while I'm looking into this I thought I'd ask a few simple questions:

Is there anything in general where a bug could cause some exception to be raised, and permanently disable logstash-logger or Rails.logger until the next server restart?

Any similar problems seen in the past?

Thanks!

FYI Sending my logs to redis. (instead of UDP directly to logstash)

Also making my puma cluster reconnect on worker launch:

on_worker_boot do
  ActiveSupport.on_load(:active_record) do
    ActiveRecord::Base.establish_connection
  end
  logger_options = Rails.application.config.logstash
  logger = LogStashLogger.new(logger_options)
  Rails.application.config.logger = logger
end

So far, so good.

Nope... still randomly stops logging :( Won't even send a message to redis.

So when logstash-logger stops logging my rails server output, I decided to make a new controller#action that uses logstash directly to log a message.

Such as:

def check_if_logstash_logger_still_alive
Rails.logger.info {whatever}
end

However even that does not work.

Needless to say, no other logging works.

At this point, we're no longer using lograge. It's simply that logstash-logger will not send redis/udp messages outbound after a random amount of time.

For Redis: Currently trying to ask @io.client.reconnect if @io.connected? is not true.
But redis client should already autoreconnect.

...

Nope that doesn't matter. Reconnecting does not help.
https://github.com/pctj101/logstash-logger/blob/v060redis_reconnect/lib/logstash-logger/device/redis.rb
Line 34

If you could answer the following questions, that would help narrow down the possible problems.

  • Are you using MRI or JRuby? Which version? (You mentioned Puma is running in clustered mode, so I'm guessing MRI?)
  • Are you using any gems that monkey around with log levels? The way Rails silences logs is not thread safe, so the logger can easily be silenced forever if you're running a threaded server such as Puma. I've had the same issue as you when using the quiet_assets gem.
  • Which configuration(s) of logstash cause the problem? Any/all, or just certain ones?
  • Does the problem go away if you stop using LogStashLogger?

Also, can you modify your controller to print out Rails.logger.level? That will help us figure out if the log level is the issue.

Are you using MRI or JRuby? Which version? (You mentioned Puma is running in clustered mode, so I'm guessing MRI?)

MRI Ruby 2.1.2

Are you using any gems that monkey around with log levels? The way Rails silences logs is not thread safe, so the logger can easily be silenced forever if you're running a threaded server such as Puma.

Yes: https://github.com/roidrage/lograge
However, it shouldn't silence all logs. But I'll totally accept there might be something crazy in there. With that said, I've used lograge v0.3.0 for months without issue.

I've had the same issue as you when using the quiet_assets gem.
Which configuration(s) of logstash cause the problem? Any/all, or just certain ones?

UDP and Redis so far are all I tested. Both have the issue.

Does the problem go away if you stop using LogStashLogger?

I turned off logstash logger & lograge at the same time. Then put in a instrument listener on "process_action.action_controller" which directly logs to logstash. Works like a charm (minus all the console debug messages...)

Also, can you modify your controller to print out Rails.logger.level? That will help us figure out if the log level is the issue.

Totally can do after the weekend is over :)

From what you're saying, the idea of the logger.level getting changed unexpectedly sounds possible. I've grepped my code, and it's nothing I am doing. However... there is admittedly more than just my code. There are lots of gems too.

In a way, for me, logging the "process_action.action_controller" is all I really "need". The remaining console messages are helpful, but not my main goal.

However, the console messages become much more helpful when I really need to debug something, so there is value to getting this whole thing working.

More news next week!

Thanks for the insight!

@pctj101 Did you find anything out yet?

@dwbutler It looks like I'm hitting the same issue.

I'm using MRI 2.2.0, Unicorn 4.5.0, lograge 0.3.1, and logstash-logger 0.8.0. Data is sent to logstash via UDP and also logged to a file:

  config.logstash = [
    {
      type: :file,
      path: 'log/logstash_production.log'
    },
    {
      type: :udp,
      port: 5228,
      host: ENV["LOGSTASH_HOST"]
    }
  ]

It looks like you're right about the log level being changed. Really good catch! Printing it reveals that it is raised to 2 for more and more Unicorn workers over time. Restarting the server obviously fixes the problem temporarily.

So now I'll try to hunt down who is changing the log level. In case you're interested, here is our "impressive" Gemfile.lock: https://gist.github.com/chopmo/88f10097f94db452089f

Well that didn't take as long as expected. I'm 99% sure that we found the problem.

We're using the Savon gem, and when calling Savon.client it is possible to pass in a logger object and a log level as options.

We accidentally passed Rails.logger without duplicating it and the Logger::WARN level. Savon changed the level on the logger.

Simple as that :)

Ouch... It sounds like Savon is not being a good logging citizen. I'll add a note to the Troubleshooting section of the readme. Thanks for the tip!

I added a general note about gems changing the log level to the Troubleshooting section and linked to this issue. I think it's been well-established that this issue is caused by other gems. I can't think of any good way to protect one's logger level from getting stomped. So I'm going to close this issue.

If anyone finds another gem causing a problem, or has any ideas about how to protect the log level from getting changed, feel free to add more comments to this issue.

i'm also suffering this. i even followed instructions in a book.
it describes how to use lograge and logstash-logger to send logs off to logstash but logs are only being sent to logstash when restarting the server (apache + passener).
this looks to me like some sort of buffering issue and indeed if i set the sync: true option in the LogStashLogger initializer, logs land in my logstash.

the buffer by default is supposed to collect up to 50 messages and flush them every 5 seconds. but it doesn't for my case. anything i'm missing?

Rails.application.configure do
  config.lograge.enabled = true
  config.lograge.custom_options = lambda do |event|
    exceptions = %w(controller action format id)
    {
      params:      event.payload[:params].except(*exceptions).to_json,
      type:        :rails,
      environment: Rails.env
    }
  end
  config.lograge.formatter = Lograge::Formatters::Logstash.new
  config.lograge.logger    = LogStashLogger.new uri: 'udp://logstash-server:1234'
end

@glaszig I noticed that you are using UDP output. Based on the symptoms you're describing, I believe you are observing normal Ruby behavior. When Ruby writes to an IO device such as a file or socket, it doesn't write immediately. It keeps it's own internal buffer and flushes it "periodically." If you write infrequently enough and don't have enough memory pressure, "periodically" could mean "never." Ruby will, of course, flush all buffers when you exit the program (which is what happens when you restart the server). The only way to control this behavior is to set sync=true, as you did. I'll make a note of this behavior in the readme.

sorry, i didn't know enough about these kind of ruby internals but i was imagining something like that. there is indeed very low traffic on this system. thanks for getting back.

if anybody else comes across this and is interested in ruby's io buffering, watch this little screencast by @jstorimer.

Hi, I believe I'm hitting this issue now myself (boo!). Everything is fine on staging, but as soon as I get to production, nada.

Are there any tips/tools that would allow me to search through all my gem dependencies to find the culprit or is it just a case of lots of searching?