gardener / gardener-extension-provider-azure

Gardener extension controller for the Azure cloud provider (https://azure.microsoft.com).

Home Page:https://gardener.cloud

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bastion controller fails to create Bastion because wrong resource group name is used to get the subnet for shoots with own vNet

vpnachev opened this issue · comments

How to categorize this issue?

/area ops-productivity
/kind bug
/platform azure

What happened:
For shoot clusters that brings their own vNet, the bastion controller fails to get the subnet as it uses the shoots resource group instead of the vNet resource group name.

In details, the subnet retrieval here

subnet, err := subnetClient.Get(ctx, opt.ResourceGroupName, vNet, subnetWork, "")
is using the resource group name from opts which is set here
opt, err := DetermineOptions(bastion, cluster, infrastructureStatus.ResourceGroup.Name)
with the resource group where the shoot infrastructure resources are deployed.

However, when the shoot is using own vNet, it has its own resource group name which should be used in this case, i.e. infrastructureStatus.Networks.VNet.ResourceGroup.

I have prepared a quick patch that I think can resolve this issue, but I will need to test it.

patch:
diff --git a/pkg/controller/bastion/actuator.go b/pkg/controller/bastion/actuator.go
index e7853410..39606d53 100644
--- a/pkg/controller/bastion/actuator.go
+++ b/pkg/controller/bastion/actuator.go
@@ -187,13 +187,18 @@ func getPublicIP(ctx context.Context, factory azureclient.Factory, opt *Options)
 	return ip, nil
 }
 
-func getSubnet(ctx context.Context, factory azureclient.Factory, vNet, subnetWork string, opt *Options) (*network.Subnet, error) {
+func getSubnet(ctx context.Context, factory azureclient.Factory, vNet, vNetResourceGroup, subnetWork string, opt *Options) (*network.Subnet, error) {
 	subnetClient, err := factory.Subnet(ctx, opt.SecretReference)
 	if err != nil {
 		return nil, err
 	}
 
-	subnet, err := subnetClient.Get(ctx, opt.ResourceGroupName, vNet, subnetWork, "")
+	rg := opt.ResourceGroupName
+	if vNetResourceGroup != "" {
+		rg = vNetResourceGroup
+	}
+
+	subnet, err := subnetClient.Get(ctx, rg, vNet, subnetWork, "")
 	if err != nil {
 		return nil, err
 	}
diff --git a/pkg/controller/bastion/actuator_reconcile.go b/pkg/controller/bastion/actuator_reconcile.go
index a5813505..eaff2182 100644
--- a/pkg/controller/bastion/actuator_reconcile.go
+++ b/pkg/controller/bastion/actuator_reconcile.go
@@ -74,7 +74,12 @@ func (a *actuator) Reconcile(ctx context.Context, bastion *extensionsv1alpha1.Ba
 		return errors.New("virtual network name and subnet must be set")
 	}
 
-	nic, err := ensureNic(ctx, factory, opt, infrastructureStatus.Networks.VNet.Name, infrastructureStatus.Networks.Subnets[0].Name, publicIP)
+	vNetResourceGroup := ""
+	if infrastructureStatus.Networks.VNet.ResourceGroup != nil {
+		vNetResourceGroup = *infrastructureStatus.Networks.VNet.ResourceGroup
+	}
+
+	nic, err := ensureNic(ctx, factory, opt, infrastructureStatus.Networks.VNet.Name, vNetResourceGroup, infrastructureStatus.Networks.Subnets[0].Name, publicIP)
 	if err != nil {
 		return err
 	}
@@ -293,7 +298,7 @@ func ensureComputeInstance(ctx context.Context, logger logr.Logger, bastion *ext
 	return nil
 }
 
-func ensureNic(ctx context.Context, factory azureclient.Factory, opt *Options, vNet, subnetWork string, publicIP *network.PublicIPAddress) (*network.Interface, error) {
+func ensureNic(ctx context.Context, factory azureclient.Factory, opt *Options, vNet, vNetResourceGroup, subnetWork string, publicIP *network.PublicIPAddress) (*network.Interface, error) {
 	nic, err := getNic(ctx, factory, opt)
 	if err != nil {
 		return nil, err
@@ -307,7 +312,7 @@ func ensureNic(ctx context.Context, factory azureclient.Factory, opt *Options, v
 
 	logger.Info("create new bastion compute instance nic")
 
-	subnet, err := getSubnet(ctx, factory, vNet, subnetWork, opt)
+	subnet, err := getSubnet(ctx, factory, vNet, vNetResourceGroup, subnetWork, opt)
 	if err != nil {
 		return nil, err
 	}

What you expected to happen:
The bastions to be successfully created when shoots are using own vNet.

How to reproduce it (as minimally and precisely as possible):

  1. Create a shoot with own vNet, but let the extension to create dedicated resource group for other infrastructure resources
  2. Try to ssh on some of the nodes
  3. Ensure that the bastion resource fails to be created with error Error reconciling bastion: virtual network subnet must be not empty

Anything else we need to know?:
During the investigation, @petersutter pointed to #397 (comment) but it is hard to follow all of the comments in this huge PR.

Environment:

  • Gardener version (if relevant):
  • Extension version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

@tedteng Could you please have a look?

I will take a look @dkistner and verify the patch as well. Thanks @vpnachev

I tried the patch from the issue description. The bastion VM was created in the resource group of the shoot successfully, but ssh did not succeed in several minutes.

I also tried another patch aiming to create the bastion in the vNet resource group but the bastion was still created in the resource group of the shoot (I have no idea why and have not investigated this)

patch:
diff --git a/pkg/controller/bastion/options.go b/pkg/controller/bastion/options.go
index c11ce9c7..abfb902b 100644
--- a/pkg/controller/bastion/options.go
+++ b/pkg/controller/bastion/options.go
@@ -16,10 +16,12 @@ package bastion
 
 import (
 	"crypto/sha256"
+	"encoding/json"
 	"fmt"
 	"net"
 
 	"github.com/Azure/go-autorest/autorest/to"
+	api "github.com/gardener/gardener-extension-provider-azure/pkg/apis/azure"
 	"github.com/gardener/gardener/extensions/pkg/controller"
 	v1beta1constants "github.com/gardener/gardener/pkg/apis/core/v1beta1/constants"
 	extensionsv1alpha1 "github.com/gardener/gardener/pkg/apis/extensions/v1alpha1"
@@ -80,6 +82,17 @@ func DetermineOptions(bastion *extensionsv1alpha1.Bastion, cluster *controller.C
 		"Type": to.StringPtr("gardenctl"),
 	}
 
+	infrastructureConfig := &api.InfrastructureConfig{}
+	err = json.Unmarshal(cluster.Shoot.Spec.Provider.InfrastructureConfig.Raw, infrastructureConfig)
+	if err != nil {
+		return nil, err
+	}
+
+	rg := resourceGroup
+	if infrastructureConfig.Networks.VNet.ResourceGroup != nil && *infrastructureConfig.Networks.VNet.ResourceGroup != "" {
+		rg = *infrastructureConfig.Networks.VNet.ResourceGroup
+	}
+
 	return &Options{
 		BastionInstanceName: baseResourceName,
 		BastionPublicIPName: publicIPResourceName(baseResourceName),
@@ -88,7 +101,7 @@ func DetermineOptions(bastion *extensionsv1alpha1.Bastion, cluster *controller.C
 		WorkersCIDR:         workersCidr,
 		DiskName:            DiskResourceName(baseResourceName),
 		Location:            cluster.Shoot.Spec.Region,
-		ResourceGroupName:   resourceGroup,
+		ResourceGroupName:   rg,
 		NicName:             NicResourceName(baseResourceName),
 		Tags:                tags,
 		SecurityGroupName:   NSGName(clusterName),

Anyway, the ssh for both patches times out waiting on

Still waiting: bastion does not have BastionReady=true condition

I tried to manually ssh just to the bastion nodes, but this fails with

gardener@X.Y.Z.W: Permission denied (publickey).

while the sshd server is reachable with telnet

$ telnet X.Y.Z.W 22
Trying X.Y.Z.W...
Connected to X.Y.Z.W.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.6p1 Ubuntu-4ubuntu0.7
^]
telnet> Connection closed.

So, it looks like the bastion node is not properly set with the ssh keys?

interesting
as you tested the sshd server is reachable, and the local client(public IP) to the firewall(ingress rules) is the reachable and correct setup. then bastion port 22 is also able to access. probably ssh keys not properly set? Could you ping me the problematic cluster information via slack @vpnachev? Thanks.

$ telnet X.Y.Z.W 22
Trying X.Y.Z.W...
Connected to X.Y.Z.W.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.6p1 Ubuntu-4ubuntu0.7
^]
telnet> Connection closed.

Could you ping me the problematic cluster information via slack @vpnachev?

Unfortunately no, the cluster was created on my local gardener landscape and I have deleted it already.

I would have hoped that this is fixed by #518 which was released last week. Did you try the bastion with the last release @vpnachev ?

/assign