Make cowbell compliant to usage with 'included_applications'

Question

Make cowbell compliant to usage with 'included_applications'

jaynel opened this issue 8 years ago · comments

If you do any initialization outside of the root supervisor, it is not possible to use as an included_application without replicating the source code in the pre-start initialization.

Roberto Ostinelli · Answer 1 · Thu May 12 2016 16:00:48 GMT+0800 (China Standard Time)

Included applications will normally be started before your own application, so when you get to yours - they are already up and running.

What do you think to be incorrect in connecting to nodes when you start your own application? Would you mind expanding and giving a short example?

Jay Nelson · Answer 2 · Thu May 12 2016 23:18:52 GMT+0800 (China Standard Time)

Included applications are not started before your application. They are part of your supervision hierarchy. See http://erlang.org/erldoc Chapter 8.

The application controller automatically loads any included applications when loading a primary
application, but does not start them. Instead, the top supervisor of the included application must be
started by a supervisor in the including application.

The key is that no initialization should be done prior to a root supervisor's start_link being called. When other applications include yours, that initialization will not be included because you merely call the root supervisor's start_link from your own supervisor.

Most OSS applications just use the application:start approach which guarantees two things:

You will randomly find startup issues when "application XXX has not been started"
If an application crashes, it won't be restarted.

Jay Nelson · Answer 3 · Thu May 12 2016 23:22:30 GMT+0800 (China Standard Time)

My issue was triggered by your README recommendation:

start(_StartType, _StartArgs) ->
    %% connect to nodes
    cowbell:connect_nodes(),
    %% start sup
    myapp_sup:start_link().

Your actual cowbell:start does not do any initialization outside of the root supervisor start_link, but recommending that to others does. I suppose it only matters if someone else were to write a library that includes cowbell, expecting others to include their library.

In general, you should not do any initialization outside of the root supervisor's start_link. The cowbell:connect_nodes() should be part of some supervisor or gen_server initialization.

Roberto Ostinelli · Answer 4 · Fri May 13 2016 04:25:40 GMT+0800 (China Standard Time)

Included applications are not started before your application.

Unless you're using it for development, you are going to produce a release. When you do that, the included_applications will be started for you by the boot script.

On top of that, many applications provide a helper function (such as myapp:start/0) which also starts the dependencies. Just an example: exometer_core:start/0 will ensure that all dependencies are started. This is a common practice.

That being said, cowbell doesn't even have other dependencies than the kernel and stdlib, so I'm not sure why you're raising this point as this doesn't really matter.

You say that:

The key is that no initialization should be done prior to a root supervisor's start_link being called.
[...]
The cowbell:connect_nodes() should be part of some supervisor or gen_server initialization.

No, I disagree. There are various initialization that can be done in the application module. Don't take my word for it, there are various examples that do so (for instance, here).

Anyway. Cowbell only connects nodes. Nothing more.

So now you got me curious. I'm still waiting to know what your issue was, and for an example on why you believe that starting cowbell in the application start/0 function is "NOT the recommended way". I'll be glad to change or nuance my example if needed.

Jay Nelson · Answer 5 · Fri May 13 2016 06:07:50 GMT+0800 (China Standard Time)

You've confused "included_applications" with "applications". There is a separate chapter on each in the documentation. This is a common mistake because nearly all OSS projects use dependent applications which are started before the main application.

I will requote the explicit requirement "the top supervisor of the included application must be started by a supervisor in the including application". You do not use application:start() for included_applications, therefore any code which executes in that function will not run (including examples like those in Cowboy).

Included applications are deployed as releases just like other applications. There is a difference in the app.config using the property included_application rather than application.

Here's a simple example of using Yaws as an included_application: https://github.com/duomark/dk_yaws/blob/master/src/dk_yaws_app.erl. Note that application:start(yaws) is never called, because the yaws_sup is embedded directly into the supervisor hierarchy. Also look at the app.config properties https://github.com/duomark/dk_yaws/blob/master/src/dk_yaws.app.src#L8-L9.

If yaws did any initialization prior to starting yaws_sup, my code would not have that environment. Adding ets tables or connecting nodes or whatever would not be present, and the including application would fail to startup or would fail at some later arbitrary place that assumes that initialization to have been done.

I guess my erlang documentation link wasn't specific enough to point to the section with the erldoc link. Look at http://erlang.org/doc/design_principles/users_guide.html Chapter 8 "Included Applications" as compared to Chapter 7 "Applications". I don't see the admonishment to not do initialization in the application:start(), but I seem to recall it being there a decade ago, so maybe things have been "improved" in the documentation.

If you look at http://erlang.org/doc/apps/kernel/application.html and read application:start/1 you see it defaults to application:start(App, temporary). This means all the OSS examples you refer to which explicitly call application:start(App) will suffer from failure of App by not restarting. Your only choice is to make App permanent, thereby stopping the VM and all applications if App fails or stops. The only way to allow a crashing App to restart is to use included_application and wedge the root supervisor under a supervisor for which you control the restart logic.

I have had this discussion before on the email list, and included_applications are almost universally not used, and especially so in OSS repos, but by providing a library which violates the rules you are precluding the possibility of anyone using included_applications with restart of your library logic. Just something to be aware of (and looking at your code, it is just the README instruction which recommends violating the restriction).

The best way to determine whether you are using included_applications is to call application:get_application(Pid_Or_Module). If I use application:start(cowbell) prior to starting MyApp, application:get_application(cowbell_sup) will return cowbell. If it is embedded as an included_application the same call will return MyApp.

Roberto Ostinelli · Answer 6 · Fri May 13 2016 20:46:10 GMT+0800 (China Standard Time)

I see.

There are very few people using the included_applications feature from OTP, and those who do (like yourself) generally know what they need to do to use it.

For the rest of us, providing a simple example on how to use a library in a "standard" application is generally seen as helpful. I still think that my recommended example is the way to go for non-included applications.

I'll consider adding a note for it in the README.

Jay Nelson · Answer 7 · Sat May 14 2016 00:17:29 GMT+0800 (China Standard Time)

Just note that your recommendation means that it is not possible for cowbell to restart when it fails. What advantage is there to doing initializations outside of a supervisor?

Roberto Ostinelli · Answer 8 · Sat May 14 2016 01:03:20 GMT+0800 (China Standard Time)

Just note that your recommendation means that it is not possible for cowbell to restart when it fails.

No, cowbell is an app with its own supervisors and it will restart if it fails. Again, I won't care about the included_applications case which has very limited and marginal usages. Feel free to use it (or not) as it pleases you.

Jay Nelson · Answer 9 · Sat May 14 2016 01:22:29 GMT+0800 (China Standard Time)

If any of your gen_servers fail, the supervisor will restart them. If the application fails (i.e., if enough failures trigger your root supervisor to fail), the VM will not restart it.

Did you read the section on application:start/1 about temporary applications?

http://erlang.org/doc/apps/kernel/application.html

Roberto Ostinelli · Answer 10 · Sat May 14 2016 02:05:43 GMT+0800 (China Standard Time)

@jaynel, you might get surprised but I'm quite familiar with Erlang's applications. :)

Ok, I got your point of wanting to use included_applications, which is why you keep on insisting on applications failure. And I already said: this is marginal, most installations don't include applications in other applications.

That being said: if you wish, issue a pull request with your recommendations and I'll do my best to review it.

Roberto Ostinelli · Answer 11 · Sat May 14 2016 16:44:29 GMT+0800 (China Standard Time)

Oops, didn't mean to close the issue. @jaynel, let me know if you want to issue a PR for the README with your recommendations or not.

Jay Nelson · Answer 12 · Tue May 17 2016 05:15:24 GMT+0800 (China Standard Time)

Your recommendations fit your intended and supported use case, so they are fine as is.

Roberto Ostinelli · Answer 13 · Tue May 17 2016 23:31:31 GMT+0800 (China Standard Time)

I've added a similar note to what you're suggesting in Syn's README, is this what you were referring to for this repo as well? Or do you see other implications that need to be tackled for this library to allow to be used as an included application?

Jay Nelson · Answer 14 · Wed May 18 2016 00:52:47 GMT+0800 (China Standard Time)

It's a philosophy thing, which is especially unknown in the Open Source community because it is part of deeper aspects of reltool, releases, and upgrade/downgrade of live nodes, in addition to true non-stop services like 911 emergency service. Having an application fall over silently and never restart can't occur in essential services.

I objected in the case of cowbell and probably in the case of syn as well because these are deep foundational plumbing services that should always be on, and totally transparent to the integrator once the restart logic and notifications are wired in to a higher level service. But it is fighting a losing battle against the world of rapid prototypes which are never made bulletproof, and the type of OSS borrowing done by startups to get to a liquidity event.

"It's never happened to me" suffices in the software world (and often with just a few months of running time). It would not suffice in other engineering disciplines. Foundational libraries need to be engineered, and over time should incorporate much deeper knowledge of the language, environment, deployment scenarios, and use cases than what an integrator has to offer for their specific application.

Personally, I would never recommend using application:start/1 for a library dependency, nor for my library as a dependency. I would recommend application:start(App, permanent) for very specific cases -- when the application cannot function properly with the missing dependency. In all other cases, I would use included_applications if at all possible.

cowbell is a useful feature: maintaining a connection network. The initialization is a tricky bit, so I don't have any real suggestions for you on startup. It really depends on how connectivity plays into the integrating application. Should the cluster always be connected? Can I disconnect while service is unavailable, or during software upgrade/downgrade, etc? And there is a chicken/egg startup issue, so I couldn't come up with clear cut suggestion for how to handle the connect_nodes/0 call. In general, I would try to make it happen on the poller for the first time and every other time, rather than having an exceptional first time connect call. But that impacts assumptions about how the library works.

Library design, engineering and implementation is hard. And it takes multiple attempts, feedback from users, and often actual failures in the real world before you can find the right approach.

As to syn, I have misgivings about its goals, but that is my personal bias. In a distributed environment, I don't like fighting to make things serialized, single-threaded, registered singletons, or other forms of non-distributed representation. I would rather embrace distributed and employ it. Things like CRDTs feel more natural to me. But I don't have problems and solutions for things you are trying to solve so it is just an uneasiness I have with locking, blocking message flow, enforcing message ordering after the fact, etc. I try not to provide commentary when I don't have solutions, but I do speak up when something seems wrong or likely to fail.

Syn has initialize once only, and ordering, and other things. The whole philosophy of the library has to fit together and be perceived as a simple functional API that embeds neatly in an application. So maybe the approach you have taken is appropriate.

What I don't like is having initialization outside of a root supervisor. You can't relocate or restart that supervisor without knowing what happened before. There is no OTP mechanism to know or re-execute, or avoid executing, inits which happen inside start before the root supervisor initialization. Adding more of them, or insisting on using that approach as a standard style breaks one of the features of OTP applications.

Roberto Ostinelli · Answer 15 · Wed May 18 2016 20:58:41 GMT+0800 (China Standard Time)

But it is fighting a losing battle against the world of rapid prototypes which are never made bulletproof

Actually I'm here now giving you proper attention, to hear what you'd think might be necessary to bring all of this to a higher integration level. :)

Thank you for sharing these thoughts.

Jay Nelson · Answer 16 · Thu May 19 2016 00:24:34 GMT+0800 (China Standard Time)

A similar issue came up yesterday from some people I am working with. The example was a cowboy integration. They were having trouble getting the dependencies to start properly (there was a list of 9 or so application:start/1 calls). In the main application:start, the cowboy dispatch table was initialized before starting the root supervisor of the main application.

The first bad outcome is that client requests will be accepted by Cowboy before the modules to serve them have been initialized (and generally it works in testing due to hot-code loading), leading to potential crashing if there are missing ets tables or other side-effect setups that need to occur.

The second issue is that cowboy could never be re-initialized if the dispatch table was corrupt, changed, or somehow cowboy crashed and restarted (using application:start(cowboy) just means it would silently die and not restart, so manual restart and reinit would have to live elsewhere as well).

I've had discussions with Ulf about this and he seemed to indicate that AXD raised most of these issues with OTP and they addressed them as best they could for that application, but the start, included_applications, and the start_phase stuff was never quite as thoroughly thought out as the rest of OTP (I may have this all wrong, it is from memory, so blame me if it is a wrong telling of the story).

Presumably, the correct way for something like the cowboy init is to start the root supervisor and do no other initialization. Then in the same module of the start(), add start_phases and maybe stop_phases if needed. The primary issue this addresses is the proper startup sequence. (Fred's writeup on "It's all about the Guarantees" talks about supervision startup and synch/asynch considerations.) All the supervisors, applications, and included_applications should be started by the time the first start_phase is synchronously called. The start_phases are listed in order in the app.config, and thus you can easily modify the startup sequence without recompiling and deploying new code. This is the obvious place to do things like connecting a node network and initializing the cowboy dispatch table.

The problem with start_phases is that they are never called on restart (because applications never restart). So you could use a start_phase, and then call that particular start_phase from the node connect monitor for example, to have one place in the code that maintains connections. Another caveat is that start_phases interplay with included_applications. If there is a start_phase in your library and your library is an included_application, the same start_phase has to be called from any including application or else it won't happen (because included_applications are never started by OTP automatically).

This all gets complicated to explain clearly and succinctly, thus why there aren't good examples on github. Libraries necessarily try to make a complicated feature simple to employ. Generally, they do so by taking a philosophical stance which allows assumptions to be used for consistency and to hide unnecessarily flexible details. The stronger and clearer your philosophy of how to use the library, the more likely you are to have a preferred opinion on how to implement it.

Stepping back from this discussion (which I have found useful to organize a whole soup of "gut feelings" I've been confronted with over the last week from various projects), I would organize things as follows:

I. Implementation

A single module that has pure functional "connect" functions (connect, disconnect, probe status)
A gen_server for polling, reconnecting, etc, that uses Module 1 for implementation
Add start_phases to your application module (which call Module 1 for implementation):
a) Connect nodes
b) Disconnect nodes
Only start the root supervisor in your application:start()

Now you are committed to neither an application, nor an included_application. Start_phases are available to connect the nodes, but at a controlled time after all other initialization has occurred.

II. Documentation

Describe that start_phases are available
Give an example of using a start_phase for connect/disconnect
a) Using application:start + start_phases
b) Calling connect/disconnect from other code (e.g., live pause/reconfig/resume)

It is now configurable and under the control of the integrator when the network is connected and when it is not.

III. Advanced Documentation

Separate section for advanced details
Explain application:start + start_phases
Explain included_applications + start_phases
Show a mechanism for triggering connect/disconnect from eventing

Deploying this sort of library is really going to be a production distributed environment. Devops territory. The advanced documentation should cover all the issues that devops would have with releasing, deploying, starting, pausing, resuming, stopping, etc. It should cover how to do rolling upgrades (since that is what most everyone is doing), while managing the connected network in an orderly fashion.

This may not be the philosophy you take, but you should have a strong philosophy that allows you to answer questions like the ones I have raised.

Roberto Ostinelli · Answer 17 · Thu May 19 2016 18:16:06 GMT+0800 (China Standard Time)

Thank you @jaynel, I will keep this one open and hope to find time to go through this. If you feel like contributing with code please let me know.