lalamove / konfig

Composable, observable and performant config handling for Go for the distributed processing era

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

General functionality of konfig

ehmo opened this issue · comments

When I ran into konfig, I really liked your approach and extensibility. I started writing an extension for support of consul, fixed couple of bugs and was almost ready to submit a pull request, however I ran into more philosophical issue and wanted to check with you to see what's your thinking.

You build konfig to behave like first class citizen within other people code. That means you terminate code if significant issues are found with the integrity of your package, which is fine. However you also completely stop processing if you run into any kind of integrity errors within a config itself.

This in general is fine for local configs and debugging, but poses major issues for all remote configs. You see, people are sloppy. Just a simple typo in a yaml config will take down the whole parser which in turns stops konfig from functioning.

Yes, you added some helpers like NoExitOnError but this will only let the program to continue working without letting it to retry to pull a remote config while wrecking a havoc in the logs.

My general strategy why using remote configs is:

  • add local config that is used up until new remote config is not provided
  • overwrite the default config with remote config if found

By stoping the whole process means that no new configs will be pulled (without informing the user). Imagine running 10,000 servers in k8 and because of a typo, none of them can pull a config from etcd. Your only solution is to restart/rebuild the whole cluster.

My suggestion would be to rebuild the logic in a way that it will not exhibit this behavior. I'm happy to chip in. Otherwise I can submit what I have, but I think that will be for me.

Hi,

So if we fail to load, we retry as many times as loader.MaxRetry() returns, then you are right, we either stop the world or stop watching based on NoExitOnError. We changed the stop watching part a few days ago and merged it yesterday. Now, if NoStopOnFailure is set to true, it continues and will wait for the next watch event.

Here is the code now in master at https://github.com/lalamove/konfig/blob/master/loader.go#L206-L217:

				if err := c.loaderLoadRetry(wl, 0); err != nil {
					// if metrics is enabled we record a load failure
					if c.cfg.Metrics {
						wl.metrics.configReloadFailure.Inc()
						t.ObserveDuration()
					}
					if c.cfg.NoStopOnFailure {
						continue
					}
					c.stop()
					return
				}

The reason why we did it this way initially is because we use it to rotate credentials across pods by requesting these credentials from Vault. The thing is these credentials expire, so we set a MaxRetry and RetryDelay to values we can cope with before our credentials expire and breaks our application, after that, we kill the app gracefully hoping it affects a single pod for example and avoiding sending a bunch of 5xx.

My suggestion is to add a StopOnFailure() bool on the Loader interface and use this to decide wether we stop the world or not when a loader fails. For us, it would make sense to do it on Vault loader for example.

Let me know what you think about it.

My suggestion is to add a StopOnFailure() bool on the Loader interface and use this to decide wether we stop the world or not when a loader fails. For us, it would make sense to do it on Vault loader for example.

I like it. It's easier than other choices. I like the whole proposal of #20

I'm done with my implementation of consul so will push that soon. As I mentioned, I ran into couple of other things that I will separate and commit them separately

I've just merged the consul loader and the changes adding the StopOnFailure to the loader. I'm closing this one.

Thank you