Load balancing on actor groups

Question

Load balancing on actor groups

LukeMathWalker opened this issue 4 years ago · comments

The usecase

Certain types of workloads follow a request/response or rpc pattern:

responding to an HTTP request in a REST API;
sending an acknowledgment to an AMQP broker when a consumer has successfully processed a message.

To ensure high availability these workloads are typically fulfilled by a software system with several replicated processing units: incoming requests/messages could modify the state of the software system as a whole (e.g. trigger a change in a database) but more often than not the actual processing units are stateless or hold ephemeral state, the bare minimum required to fulfill a task (e.g. a connection to a database).
In other words, we don't care which specific processing unit handles a request or message - it won't affect the outcome.

Actors can be quite useful to reason about these workloads:

a REST API can be modeled as a TcpListener that dispatches requests to a set of actor workers;
a queue worker can be modeled as a groups of actors, one group for each handler or message type.
Each actor encapsulates the logic to build the required ephemeral state (e.g. connect to a db, create a queue and bind it to an exchange) and process incoming units of work.

Supervisors can be used to ensure that our system recovers gracefully when a transient failure occurs (e.g. a network disruption causes most of the queue workers to lose connection with the broker and panic/return an error) - just model the happy case, let it crash if some of the resources you depend on are not available, delegate to the supervisor system recovery with exponential/linear backoffs.

Current limitation

In 0.3.4 we can have redundancy for a children group via a supervisor and a restart policy, but:

There is no way to ask a group of children and receive a single response;
ChildrenRef is invalidated if a restart occurs;
There is no way to get a ChildrenRef/Vec<ChildRef> from a SupervisorRef.

In master, via Dispatcher, bastion supports natively the concept of an addressable target group, but the communication is one-sided: there is no way to get a response back, because different dispatchers implementation might implement different semantics (pure broadcasting or 1-to-many, noop or message dropping, load balancing or 1-to-1).

Proposal

Make a clear distinction in the Dispatcher trait between 1-to-1 (ask/tell) and 1-to-many (broadcast) communication.
Even if bastion's next release only ends up supporting one way communication (tell/broadcast) it will allow backward compatible extension of the API and prevent ambiguity in how to implement a certain communication flow.

Luca Palmieri · Answer 1 · Fri Jun 19 2020 04:22:45 GMT+0800 (China Standard Time)

Speaking with very little awareness of implementation constraints, the API I have in mind looks more or less like this (compare it with the existing example https://github.com/bastion-rs/bastion/blob/master/src/bastion/examples/middleware.rs):

fn main() {
    env_logger::init();

    Bastion::init();

    // Workers that process the work. 
    let supervisor = Bastion::supervisor(|sp| sp.with_restart_strategy(RestartStrategy::default())).unwrap();
    let workers = supervisor.children(|children: Children| {
        children
            .with_redundancy(100)
            .with_exec(move |ctx: BastionContext| {
                async move {
                    loop {
                        msg! { ctx.recv().await?,
                            stream: TcpStream =!> {
                                // ... processing logic ...
                                answer!(ctx, stream).expect("Couldn't send an answer.");
                            };
                            _: _ => ();
                        }
                    }
                }
            })
    })
    .expect("Couldn't start a new children group.");

    let workers = Arc::new(workers);

    // Server entrypoint
    Bastion::children(|children: Children| {
        children.with_exec(move |ctx: BastionContext| {
            let workers = workers.clone();
            async move {
                println!("Server is starting!");

                let listener = TcpListener::bind("127.0.0.1:2278").unwrap();

                for stream in listener.incoming() {
                    let _ = ctx.ask_group(&workers, stream.unwrap(), RoundRobinStrategy).unwrap().await?;
                }

                // Send a signal to system that computation is finished.
                Bastion::stop();

                Ok(())
            }
        })
    })
    .expect("Couldn't start a new children group.");

    Bastion::start();
    Bastion::block_until_stopped();
}

where RoundRobinStrategy would implement a LoadBalance trait along these lines:

pub trait LoadBalance {
    pub fn choose(&self, children: &[ChildRef]) -> &ChildRef;
}

ask_group implementation would then call choose internally to determine which actor should handle the message. The same applies for tell_group. The ChildRefs that belong to the group could be retrieved by a concurrent hashmap that tracks group membership for actors, I imagine somewhat similarly to what is currently in place for dispatchers.
Some smart logic in ask_group and tell_group could be used to re-route messages if the chosen actor dies between the moment of choosing the address and the actual dispatch using dead letter mailboxes.
Broadcast would be implemented by bastion directly on ChildrenRef, without customisation options (a.k.a. send the message to all actors in the group).

Feel free to tear this apart as non-sensical 😅

Luca Palmieri · Answer 2 · Fri Jun 19 2020 16:02:02 GMT+0800 (China Standard Time)

Re-reading this today, I noticed that I put the choice of load balancing strategy on the caller, while it makes more sense (and it's more coherent with the current structure) to have it on the consumer (ChildrenRef) using a method like with_lb_strategy, similar to the current with_dispatcher.

Jeremy Lempereur · Answer 3 · Fri Aug 21 2020 00:59:25 GMT+0800 (China Standard Time)

Just merged #268 which fixes a couple of dispatcher related hiccups, and allows you to have round robin with named DispatcherTypes and named BroadcastTargets.

I'm not a fan of the naming for now, but there's more in the works to make it easier, hopefully soon ™️ :D

If you want to define your own dispatch mechanism, we might need to dig a bit deeper and add examples on how to use the Dispatcher trait and the new ChildRef is_public() method that allows you to filter and choose a recipient for your messages.

So it's rather a workaround for now I guess, but I hope it helps!

Jeremy Lempereur · Answer 4 · Wed Apr 07 2021 16:15:59 GMT+0800 (China Standard Time)

Ok I think the Distributor api now allows for all of the requirements here!

as shown here we can create a let workers = Distributor::named("workers") and send messages / questions to one or all of them:

// in the setup side
let children =  Bastion::children(|children: Children| {
        children.with_distributor(Distributor::named("workers")).with_exec(/* ... */);

// and on the caller side
let workers = Distributor::named("workers");
workers.tell_one("hello you!"); // Result<(), SendError>
workers.tell_everyone("hello you!".to_string()); // Result<Vec<()>, SendError>
let answer: Answer = workers.ask_one("marco?"); Result<Answer, SendError>
let answers: Vec<Answer>= workers.ask_everyone("hows it going everyone?".to_string());// Result<Vec<Answer>, SendError>

going to close this and hope it fits the use case! :) if not please let me know, I'll gladly iterate on that!