truemail-rb / truemail

New bug checklist

I have updated truemail to the latest version
I have read the Contribution Guidelines
I have read the documentation
I have searched for existing GitHub issues

Bug description

we're using Truemail for validating e-mail addresses in our opensource community software hitobito

it works perfectly until some time has passed, then it just returns false for every e-mail checked. we're using it for example in the Person model: https://github.com/hitobito/hitobito/blob/master/app/models/concerns/validated_email.rb#L17

after some time, sometimes days, sometimes only hours, it stops working and Truemail.valid? always returns false. To fix this, we have to restart the rails server, then it works perfectly again for some time. It affects multiple environments running in different openshift projects.

we created a health check endpoint for monitoring truemail: https://github.com/hitobito/hitobito/blob/8a6ef125be5f67413dec8f0cfc5e2c87dfa0f2ce/app/domain/app_status/truemail.rb#L8

if it's not working anymore, a simple touch tmp/restart.txt fixes the problem.

since we do not have any clue what's causing this problems, we're reaching out here. Maybe someone has any idea what could cause this strange problem.

Complete output when running truemail, including the stack trace and command used

no stack trace here because it's not throwing any exceptions at all

@mtnstar Hi! Thanks for report. Very strange issue... But you didn't provide any context of truemail. I mean some logs, when this situation happens. You can do it with build-in event-logger: https://truemail-rb.org/truemail-gem/#/event-logger

Could you configure logging and post results here when your issue will happen again? Also you can run from rails console Truemail.validate(some_email) to receive more context.

Please, post some log details here. And I will continue to look into this issue.

@bestwebua thx for the input, I now enabled the debug log for some environments and hopping to catch some errors the next time Truemail stops working accordingly.

@bestwebua In the meantime, this happens every day on at least some of our ~10-15 production instances. Below is an example log entry, triggered from a health check call we installed:

E, [2022-12-02T15:48:02.740667 #1] ERROR -- : Truemail mx validation for hitobito@puzzle.ch failed (mx: target host(s) not found)
CONFIGURATION SETTINGS:
whitelist validation: false
not rfc mx lookup flow: false
smtp fail fast: false
smtp safe check: false
email pattern: default gem value
smtp error body pattern: default gem value

But if I then open a new rails console in that same container which continually has the error and execute the command you proposed, I get the following (validation success):

[1] pry(main)> Truemail.validate('hitobito@puzzle.ch')
I, [2022-12-02T15:51:29.000450 #18716]  INFO -- : Truemail mx validation for hitobito@puzzle.ch was successful

CONFIGURATION SETTINGS:
whitelist validation: false
not rfc mx lookup flow: false
smtp fail fast: false
smtp safe check: false
email pattern: default gem value
smtp error body pattern: default gem value

=> #<Truemail::Validator:0x00000000022f3940
 @result=
  #<struct Truemail::Validator::Result
   success=true,
   email="hitobito@puzzle.ch",
   domain="puzzle.ch",
   mail_servers=["5.102.145.24", "5.102.146.40"],
   errors={},
   smtp_debug=nil,
   configuration=
    #<Truemail::Configuration:0x00000000022f39b8
     @blacklisted_domains=[],
     @blacklisted_mx_ip_addresses=[],
     @connection_attempts=2,
     @connection_timeout=2,
     @default_validation_type=:mx,
     @dns=[],
     @email_pattern=/(?=\A.{6,255}\z)(\A([\p{L}0-9]+[\w\p{L}.+!~,'&%#*^`{}|\-\/?=$]*)@((?i-mx:[\p{L}0-9]+([\-.]{1}[\p{L}0-9]+)*\.\p{L}{2,63}))\z)/,
     @logger=#<Truemail::Logger:0x00000000049f4210 @event=:all, @file="/opt/app-root/src/log/truemail.log", @stdout=true>,
     @not_rfc_mx_lookup_flow=false,
     @response_timeout=2,
     @smtp_error_body_pattern=/(?=.*550)(?=.*(user|account|customer|mailbox)).*/i,
     @smtp_fail_fast=false,
     @smtp_port=25,
     @smtp_safe_check=false,
     @validation_type_by_domain={},
     @verifier_domain="puzzle.ch",
     @verifier_email="hitobito@puzzle.ch",
     @whitelist_validation=false,
     @whitelisted_domains=[]>>,
 @validation_type=:mx>

Meanwhile the rails puma process keeps on failing to validate even the simplest email addresses. So it must be something local to the process which runs Truemail. Restarting the puma process (by touch tmp/restart.txt) fixes the issue every time.

Maybe it has something to do with us running this application on an OpenShift (Kubernetes) architecture? The spontaneous way in which this error appears makes me think that Truemail / the MX lookup might try to keep some connection open, which breaks if the running pod is moved to another kubernetes node maybe..?

Very strange. hitobito@puzzle.ch is a normal email address which we know exists because we control it. But our problem is not limited to this address, but concerns any email address. Normally Truemail works as expected, but then spontaneously stops working inside our rails puma process. By "stops working" I mean it stops working completely. Checking any email address will take long (~10-30 seconds) and then fail.

So, based on my suspicion of some network problem when something spontaneously changes in a kubernetes cluster, can you point me to the code where you do the actual MX request which fails here? Then we can try to debug deeper and find out whether the problem lies in a dependency of truemail, or within truemail itself.

@carlobeltrame Thank you for your investigation ❤️ Today I have investigated this issue too, and I think it's 100% network issue. To make sure this is a truth I suggest to add some network status loggers to DNS validation layer. I'll provide some code snippets in next hour. Appreciate.

@carlobeltrame You can temporary monkey patch Truemail::Wrapper#call to expose network error context:

truemail/lib/truemail/wrapper.rb

Line 19 in 8a570ff

rescue ::Resolv::ResolvError, ::IPAddr::InvalidAddressError

To do it just add into your Rails initializers truemail.rb file:

# frozen_string_literal: true

module Truemail
  class Wrapper
    def call(&block)
      ::Timeout.timeout(timeout, &block)
    rescue ::Resolv::ResolvError, ::IPAddr::InvalidAddressError => error
      ::Logger.new($stdout).add(::Logger::ERROR) { error }
      false
    rescue ::Timeout::Error => error
      retry unless (self.attempts -= 1).zero?
      ::Logger.new($stdout).add(::Logger::ERROR) { error }
      false
    end
  end
end

Now we can see error context during network interactions... Please let me know about your results. Thanks!!!

P.S.: seems like skipped feature, from early truemail roadmap, with extended errors (network, dns) is make sense.

Thanks, we'll try that and report back!

@bestwebua So, we just had the problem again, this time with logging you suggested. From what I see, Truemail runs into timeouts when requesting (in that order):

mx_records
cname_record
a_record

After that, it concludes with a failure, because (understandably) mx: target host(s) not found. After restarting puma inside the container by touching tmp/restart.txt, the check runs fine again and the runtime for the check dropped from 12_000ms again to 11ms, which seems normal.

For the full backtrace, I attached a log-excerpt:

truemail-dns-lookup-problem.log

I removed tokens and requests to our normal health-check, which only says "yeah, the rails-process is here". Otherwise, I left the log untouched.

@kronn Thanks. I'm 100 percent sure it's a network problem. Seems like to ensure that actually is this you need add to Timeout::Error rescue block some network interaction.

I am a bit unsure, what you mean.

Do you suggest adding some other network interaction in the rescue-block we added above?
That would be here:

https://github.com/hitobito/hitobito/blob/078f56e9861c81d28d9ed54f856e87694b7b6cf8/config/initializers/truemail_config.rb#L19-L23

Or did you mean something else? If so, please clarify 😃

@kronn I mean to add some kind of healthcheck of your system (some interaction with external host). Something like from example below:

# frozen_string_literal: true

module Truemail
  class Wrapper
    def call(&block)
      ::Timeout.timeout(timeout, &block)
    rescue ::Resolv::ResolvError, ::IPAddr::InvalidAddressError => error
      ::Logger.new($stdout).add(::Logger::ERROR) { error }
      false
    rescue ::Timeout::Error => error
      retry unless (self.attempts -= 1).zero?
      ::Logger.new($stdout).add(::Logger::ERROR) { "Healthcheck: #{::Net::HTTP.get_response(URI('https://github.com')).code}" }
      ::Logger.new($stdout).add(::Logger::ERROR) { error }
      false
    rescue => error
      ::Logger.new($stdout).add(::Logger::ERROR) { "Network error: #{error}" }
      false
    end
  end
end

We decided to not debug this any further. If we did, we would try to determine the right cache of the network topology in truemail, ruby-stdlib, ruby or whereever. While interesting, this is a rather long process with little benefit.

We now check a known-good address with the known-good truemail and if this does not work, we restart the container. After that, everything works again. Since this is a Kubernetes-Feature, the problem is detected quickly and handled automatically.

Thanks for the debugging help.

@kronn Also you can use dockerized truemail instead of using truemail gem directly in your application. Thanks for your report!

[QUESTION] Truemail not working after some time running in rails process

New bug checklist

Bug description

Complete output when running truemail, including the stack trace and command used