Fault Injection for Hosts' CPUs and Recovery mechanism which dynamically removes failed PEs from VMs and starts VM's snapshots when a Host or VM completely fails

Question

Fault Injection for Hosts' CPUs and Recovery mechanism which dynamically removes failed PEs from VMs and starts VM's snapshots when a Host or VM completely fails

RaysaOliveira opened this issue 7 years ago · comments

Feature

Implements a mechanism to inject random failures into Hosts' PEs (CPUs).

The Host Fault Injection class enables injecting random failures into Hosts PEs. It uses a given Pseudo Random Number Generator (PRNG) following some statistical distribution to generate times of failure. PRNGs such as the new PoissonDistr can be used for this purpose. Internally, it's created other PRNG to define how many Host PEs will fail when a fault is generated.

The HostFaultInjection class works as a fault injector for the Hosts of a given Datacenter. The mechanism considers the following situations.

Removal of failed PEs from VMs

If the number of working PEs is lower than the total required by all Vms, then failed PEs will be removed from running VMs, using a round-robin algorithm, one PE by Vm at a time. If all PEs are removed from a VM, such a VM is destroyed.

Management of affected VMs

Generated failures may or may not affect running VMs. If the number of working PEs remaining into a Host is higher than the total PEs required by all VMs, the failure will not cause any side effect.
If there are N free PEs into the Host and the number of failed PEs is less or equal to N, no VM will be affected.

If no VMs is affect by the failure, failed Host PEs are just set to Pe.Status.FAILED and they will be unavailable. If new VMs are tried to be placed into that Host, such PEs will not be available for them.

Start a VM snapshot (clone) when all VMs from the same broker fail

If all PEs of a Host fail, all its VMs are immediately destroyed. When all VMs from a given broker fail (doesn't matter in which Host they were), a clone for the last failed VM is created. This cloning process copies previous Cloudlets which were executing or waiting into the failed VM to the cloned VM. By cloning a VM, it is simulated starting a snapshot of that VM, as in a real cloud infrastructure.

Increase completion time for cloudlets affected by removed VM PEs

Consider a VM has N PEs. If some of its PEs fail and there were Cloudlets using these PEs, Cloudlets will continue to be executed but should spend more time to finish.
- Example: the VM has 2 PEs and a Cloudlet is using all of them. If one PE fails, the Cloudlet will spend the double of the time to finish using just the remaining PE.

VM Migration when Host is overloaded because of failures

If failure of PEs into a Host increase the percentage of CPU usage, which may cause Host overload, using a VmAllocationPolicyMigration should make VMs to be migrated to another Host.

Implementation Details

Using the HostFaultInject.addVmCloner() method, a VmCloner object may be set to define how to clone a given VM when all PEs it was using fail. Setting a VmCloner enables simulating the creation of a snapshot for that VM. This way, the HostFaultInject.addVmCloner will use this object to create a new VM when the all VMs from a specific broker fail, recovering from the failure.

Since each broker represents a customer, you can simulate the execution of multiple VMs, representing the same service such as a Web Server. These multiple VMs may be used to simulate load balancing and fault tolerance of a hosted service. If you have, for instance, 3 VMs simulating the replication of the same service, this scenario has a 2-fault tolerance level. That means your service will keep running if the maximum of 2 failures happen.

In this scenario, using the VmCloner you get a 3-fault-tolerance level. That is, if all these 3 VMs are destroyed, then a snapshot of the last destroyed VM will be created. The snapshot will take some time to be started, which is randomly chosen internally, simulating the time to get the new VM up and running. Meanwhile, the service will experience some downtime.

See #105 for more details.

Available Examples

HostFaultInjectionExample1.java