tinkerbell / pbnj

Service for interacting with BMCs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Integrate PBNJ into Tinkerbell k8s model

pokearu opened this issue · comments

Currently PBNJ is a standalone service that performs power management operations. It would benefit to have a formal integration with the Tinkerbell stack with the changes for k8s resource model.

Expected Behaviour

When provisioning baremetal nodes using Tinkerbell, the pbnj component would be responsible for the power/boot management of the nodes. The hardware CRD can be extended to contain the necessary BMC information, that pbnj may leverage to perform actions. This would help power on nodes, create BMC users and setting boot options. Also opens a scope to deprovision nodes, perform reboots/resets etc.

Current Behaviour

Manual intervention is required for powering up baremetal nodes and setting the boot order to net boot for Tinkerbell provisioning.

Initial Ideas

These are some rough ideas that can be discussed and expanded to a more formal proposal.

PBNJ as k8s Service

Currently PBNJ is a GRPC service, this can be run on the k8s cluster along with all the other Tinkerbell components (Boots, Hegel). The PBNJ service would have read access to the Hardware CRDs to fetch the BMC information and perform actions.

PBNJ as a k8s Controller

PBNJ can be redesigned to be a k8s controller. The controller could watch Workflow CRDs and pickup tasks tagged to it and perform power management actions.

PBNJ as a Hub action

This idea is based off tink-worker, we could possibly have a long running pbnj-worker on the same cluster as the Tinkerbell stack. The pbnj-worker could run hub actions, which use PBNJ binary to perform power management tasks.

Upon consideration of the initial ideas, PBNJ as a k8s controller is the approach I wish to elaborate and push forward.

PBNJ as a k8s controller

In this approach we convert PBNJ into a k8s controller, that reconciles to perform desired PBNJ power/boot management actions.

Hardware CR changes

We require an initial update to the tinkerbell hardware CRD. The idea here is that the Hardware CR would have a reference to its corresponding BMC object.

type HardwareSpec struct {
...

    BmcRef BmcReference `json:"bmcRef,omitempty"`
}

BMC CRD

The PBNJ controller would be responsible for reconciling and maintaining the desired state of the BMC object on the cluster. The BMC object contains the required bmc information like host IP, vendor, etc. Along with the desired state of the BMC like Power, Boot preference, NTP etc.

PowerJob CRD

In addition to maintaining the desired state of the BMC, pbnj controller can perform a desired set of actions, as a one off job. The job may include tasks like Power Off -> Set one-time Net boot -> Power On -> Set persistent Disk boot. Once the job is complete, the controller brings the machines back to their desired state. This gives the clients the flexibility to power cycle or reset nodes for updates/maintenance.

The Client

In this approach, the client to the pbnj controller can either be an end user, who does kubectl apply of the BMC object to set the desired state for all BMC in a data center. Or automation like CAPT can create the necessary objects to get nodes to the desired power state for provisioning.

@jacobweinstock This probably deserves some labeling given we're pushing ahead.

Note the implementation isn't landing in pbnj, it'll be in its own repository. Currently thats the rufio repository but it may get renamed. This issue is probably worth leaving open until that work is complete just for tracking and linking purposes.

@chrisdoherty4 would you consider closing this now that https://github.com/tinkerbell/rufio has come along a bit further? Either way, perhaps offer a diff of the points raised in @pokearu's two comments that define the goals.

When running with a Kube back-end I'm not sure PBnJ makes sense because all the interactions its required for are handled by Rufio.

If we want to talk about changing to use Kubernetes back-end as the primary/only back-end then I suspect Rufio would only need integrating if users want to talk BMC with a request-response type API. This feels like a bigger discussion than this ticket and other issues in the Tinkerbell space have a similar commentary - lets chat at a community meeting.

Closing this as github.com/tinkerbell/rufio is provides a Kubernetes based BMC service.