skx / overseer

A golang-based remote protocol tester for testing sites & service availability

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Delivery of notifications via MQ sometimes fails on the last test which is applied.

skx opened this issue · comments

NOTE: Before reading this particular issue note that it is 100% specific to the "local" worker

This morning I received a pair of alerts informing me that blogspam had an SSL certificate nearing expiration (one for IPv4 and one for IPv6). This alert was expected, and once I renewed the certificate I expected the notifications to clear, but they did not.

The test is in one file:

   root@www ~ # cat /opt/overseer/tests.d/blogspam.conf 
   # BlogSpam
   https://blogspam.net/xml/stats must run http with content '<spam>'

Running this test manually, like so, should have triggered an MQ notification:

    root@www ~ # /opt/overseer/bin/overseer local -verbose -mq localhost:1883 /opt/overseer/tests.d/blogspam.conf 
    Running 'http' test against blogspam.net (2a01:4f8:151:6083::101)
    SSLExpiration testing: blogspam.net:443
    SSLExpiration - certificate: blogspam.net expires in 2158 hours (89 days)
    SSLExpiration - certificate: Let's Encrypt Authority X3 expires in 24899 hours (1037 days)
    SSLExpiration - certificate: DST Root CA X3 expires in 29625 hours (1234 days)
    	[1/5] - Test passed.
    Running 'http' test against blogspam.net (176.9.183.101)
    SSLExpiration testing: blogspam.net:443
    SSLExpiration - certificate: blogspam.net expires in 2158 hours (89 days)
    SSLExpiration - certificate: Let's Encrypt Authority X3 expires in 24899 hours (1037 days)
    SSLExpiration - certificate: DST Root CA X3 expires in 29625 hours (1234 days)
    	[1/5] - Test passed.

So what went wrong? Well this is what should have happened:

  • Open the notifier (i.e. MQ connection)
  • Parse the tests.
    • For each test run it
    • For each test publish the result over MQ
  • No more tests? Exit

It's the last bit that is the problem:

  • The result of the final test was published to MQ
  • The process exited

However the MQ publishing didn't await an ack, or confirmation, so the actual action was:

  • Fire the message at MQ
  • Exit
    • Before that message was delivered to MQ.

This behaviour explains why the overseer worker mode of operation wasn't affected - because in that mode the worker keeps running forever, and the persistent notification setup (as implemented in #17) meant that there was no join/part to the MQ server.

In fact if you look at an older commit you can see where I added some code to work around this problem:

//
// This seems to be necessary ..  Sigh
//
    time.Sleep(500 * time.Millisecond)

Adding a sleep is a bad solution because you never know how long you need to sleep - what you actually need to do is await the MQ-delivery, or otherwise have an acknowledgement of some kind.

In conclusion:

  • Our MQ-publish must await a successful delivery.
    • If that means a new/different client library then so be it.

This issue covers my problem:

Thinking about this further there are two useful approaches I could take:

  • Look for a new MQ library.
  • Drop the local-mode entirely.

Looking for a new MQ library, that explicitly acknowledges sends, is the simplest solution. Having had a quick google I can't find anything suitable off-hand. Moving to Amazon SQS would solve the problem - but I suspect being tied to amazon might make the project less useful for others. (Similarly moving to rabbitmq, would allow more reliability but that's more heavyweight.)

Dropping the local mode solves the problem by accident, since the worker-mode always keeps a connection to MQ open, with no exit we don't need to worry about messages getting lost. This also simplifies the README.md file as we don't have to distinguish between "small" and "large" networks.

However making the project work in a distributed fashion by default means that users have a slightly more complex setup, and redis is mandatory (probably not a huge concern). I'm actually leaning in this direction, but I'll dwell on it some more first.

(Kept the local mode; updated the MQ-handling to subscribe to the topic we're publishing upon, such that it can see its own message.)