Running instances using pay-as-you-go

Question

Running instances using pay-as-you-go

hoh opened this issue 7 months ago · comments

Hugo Herter commented 7 months ago

Compute Resource Nodes must be able to run instances using the pay-as-you-go (PAYG) model.

Requirements

An Instance, using the scheduler or the Payg approach, must only be scheduled where there are enough resources available.
A user can create the SuperFluid streams to the node operator and the network.
A CRN can do all required validation to ensure that scheduled instances match all requirements.

1. Resources available

A user must be able to select a CRN with enough resources available (compute units and storage) to run his PAYG VM.

The current scheduler plans the deployment of persistent VMs based on the resources available on a node and based on the resources reserved by other persistent VMs.

A CRN should therefore expose the resources that are reserved by VMs, both scheduled and PAYG, and the resources that are available.

CRNs currently expose system information on /about/usage/system, see an example here.

{"cpu": {"count": 12, "load_average": {"load1": 0.7177734375, "load5": 0.82958984375, "load15": 0.7412109375}, "core_frequencies": {"min": 800.0, "max": 4800.0}}, "mem": {"total_kB": 67337909, "available_kB": 65178669}, "disk": {"total_kB": 500673052, "available_kB": 470017478}, "period": {"start_timestamp": "2023-12-13T09:31:00+00:00", "duration_seconds": 60.0}, "properties": {"cpu": {"architecture": "x86_64", "vendor": "GenuineIntel"}}, "active": true}

CRNs also expose:

a private API (restricted by the use of a token available only to the node operator), about the currently running VMs on about/executions.
a public API with the information about VMs that stopped running on /about/executions/records, see example here.

The idea being that a malicious actor would not have life resource data about a VM he wants to attack.

💡 Suggestion: I recommend adding an API that exposes the amount of compute units and extra storage (available storage minus the max compute units possible on the system) available on the system. This must include both scheduled persistent VMs and PAYG instances.

2. Creating SuperFluid streams

Each CRN must be provided with a unique Avalanche wallet address, and must be aware of that address.

The best place to obtain this would be to store it in the aggregate that contains all node information (used by https://account.aleph.im/ ).

The CRN would then be able to fetch the information from there, or to double check it against its own configuration for double security. This address is not supposed to change frequently, and a restart of aleph-vm is acceptable if it changes since it would be a mess for currently scheduled instances anyways.

3. Stream validation

Extend the aleph.im message specification to include payment information and update PyAleph to accept instance messages with no token held when using the stream payment approach.

Question: Should PyAleph check that the streams are present to accept the instance message ?

A CRN can be notified of a new instance scheduled on it either:

By watching new instance messages via websocket. Such connection is already in place to update programs.
By receiving a specific request on a new HTTP endpoint with the item_hash of the instance.
By the VM scheduler.

Once notified, a CRN will:

Reserve the required resources
Check for the presence of the PAYG streams.
Check the volume of the streams based on existing PAYG instances on the same node. If the volume is invalid or the stream is missing, the resources will be freed.
Our discussion on SuperFluid documents how to check the presence of the flows.
Start the instance
Start monitoring the volume of the streams. If invalid, the node will stop the PAYG resources, starting with the most recent ones, and publish a message on the network to notify that the resource has been de-allocated.

Once notified, the VM scheduler will take into account the lower amount of resources available on the CRN for the scheduling of future holder tier persistent VMs.

Hugo Herter commented 7 months ago

See #512

Hugo Herter · Answer 1 · Thu Dec 14 2023 18:53:05 GMT+0800 (China Standard Time)

@hoh will create the API that exposes the available resources - peer programming tomorrow ?
@hoh will start a skeleton for the recurrent check of SuperFluids and held balance from CCN , @MHHukiewitz and @1yam will complete these and provide the functions for fetching fluid information.
@nesitor will work on section 3. Stream validation ( reserving the required resources, ... ). Have a look at https://docs.python.org/3/library/asyncio-sync.html .
@nesitor will modify the CCN to accept instance messages for PAYG - no token held