Bridge mode fails

Question

Bridge mode fails

asimpleidea opened this issue 2 years ago · comments

Hi,

thank you so much for this project, it is really a life saver.

Recently I have been trying to create a bridged network and assign static IPs to all machines but I keep failing with message:

Error: couldn't retrieve IP address of domain id: 7c5e009c-443d-4ba4-a61a-0d9bbf3f61d3. Please check following:
1) is the domain running proplerly?
2) has the network interface an IP address?
3) Networking issues on your libvirt setup?
 4) is DHCP enabled on this Domain's network?
5) if you use bridge network, the domain should have the pkg qemu-agent installed
IMPORTANT: This error is not a terraform libvirt-provider error, but an error caused by your KVM/libvirt infrastructure configuration/setup
 timeout while waiting for state to become 'all-addresses-obtained' (last state: 'waiting-addresses', timeout: 5m0s)

  with module.worker_module["1"].libvirt_domain.vm_domain,
  on modules/vm/vm.tf line 58, in resource "libvirt_domain" "vm_domain":
  58: resource "libvirt_domain" "vm_domain" {

Basically it keeps waiting for ips even though they are assigned statically and times out after 5 minutes:

2022-03-01T10:34:39.842Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] waiting for network address for iface=52:54:00:6C:3C:01
2022-03-01T10:34:39.842Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] qemu-agent used to query interface info
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] Interfaces info obtained with libvirt API:
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: ([]libvirt.DomainInterface) <nil>
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14:
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] ifaces with addresses: []
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] 52:54:00:6C:3C:01 doesn't have IP address(es) yet...
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] IP address not found for iface=52:54:00:6C:3C:01: will try in a while
2022-03-01T10:34:39.843Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [TRACE] Waiting 10s before next try
2022-03-01T10:34:39.880Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] waiting for network address for iface=52:54:00:6C:3C:02
2022-03-01T10:34:39.880Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] qemu-agent used to query interface info
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] Interfaces info obtained with libvirt API:
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: ([]libvirt.DomainInterface) <nil>
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14:
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] ifaces with addresses: []
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] 52:54:00:6C:3C:02 doesn't have IP address(es) yet...
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [DEBUG] IP address not found for iface=52:54:00:6C:3C:02: will try in a while
2022-03-01T10:34:39.881Z [DEBUG] provider.terraform-provider-libvirt_v0.6.14: 2022/03/01 10:34:39 [TRACE] Waiting 10s before next try

Does anyone know why this is happening?

Din Mušić · Answer 1 · Tue Mar 01 2022 21:15:09 GMT+0800 (China Standard Time)

Hi,

is the bridge interface preconfigured on the host machine?

Even if that is not the case, I will add a check for the bridge device to detect the error earlier.

Elis Lulja · Answer 2 · Tue Mar 01 2022 22:13:39 GMT+0800 (China Standard Time)

Yes I followed different approaches, both with netplan and by following the guide in example folder.

This is networkctl status -a:

● 2: ens160
       Link File: /lib/systemd/network/99-default.link
    Network File: /run/systemd/network/10-netplan-ens160.networ
            Type: ether
           State: routable (configuring)
            Path: pci-0000:03:00.0
          Driver: vmxnet3
          Vendor: VMware
           Model: VMXNET3 Ethernet Controller
      HW Address: 00:0c:29:19:84:1c (VMware, Inc.)
         Address: 192.168.1.160
         Gateway: 192.168.1.1
             DNS: 8.8.8.8

[...]

● 4: br18
       Link File: /lib/systemd/network/99-default.link
    Network File: n/a
            Type: ether
           State: routable (unmanaged)
          Driver: bridge
      HW Address: ba:8e:02:dc:4f:b9
         Address: 192.168.1.160
                  fe80::b88e:2ff:fedc:4fb9
         Gateway: 192.168.1.1

and this is /etc/systemd/network/br0-static-ip.network:

[Match]
Name=br18

[Network]
Address=192.168.1.160/24
Gateway=192.168.1.1
DNS=192.168.1.1     # Router's DNS
# DNS=8.8.8.8       # Additional DNS if required

Thanks for the help!

Din Mušić · Answer 3 · Tue Mar 01 2022 23:42:43 GMT+0800 (China Standard Time)

I'm not sure what is the correlation between these two interfaces (br18 and ens160) as ens160 is created by VMWare and is not enslaved to the bridge interface.

First make sure that the created bridge is active and that it has been given an IP address (from the given code snippet it seems it is).

To me it seems that bridge device is misconfigured and as a consequence libvirt provider cannot gather IP addresses for virtual machines, but I may be wrong.

For example I would create my bridge interface using netplan as follows:

network:
  version: 2
  bridges:
    br18:
      interfaces:
      - ens160
      dhcp4: true
      dhcp6: false
  ethernets:
    ens160: {}

Can you also provide network section from terraform.tfvars?

Elis Lulja · Answer 4 · Wed Mar 02 2022 21:21:35 GMT+0800 (China Standard Time)

One question, do I need to have dhcp4: true in the bridge even though I am assigning static ips in network section?
With netplan I created the bridge like this:

network:
  version: 2
  renderer: networkd
  ethernets:
    ens33:
      addresses: [ 192.168.1.26/24 ]
      gateway4: 192.168.1.1
      nameservers:
          addresses:
              - "192.168.1.1"
    ens160: 
      dhcp4: false
      dhcp6: false
  bridges:
    br18:
      dhcp4: false
      dhcp6: false
      nameservers:
          addresses:
              - "192.168.1.1"
      addresses: [ 192.168.1.160/24 ]
      interfaces:
      - ens160

Some relevant parts of terraform.tfvars:

# Network mode (nat, route, bridge) #
network_mode = "bridge"

# Network CIDR (example: 192.168.113.0/24) #
network_cidr = "192.168.1.0/24"

# Network (virtual) bridge #
# Note: For network mode 'bridge', bridge on host needs to preconfigured (example: br0) #
network_bridge = "br18"

# Network gateway (example: 192.168.113.1) #
# Note: If not provided, it will be calculated as first host in network CIDR. #
#       +-> first host of 192.168.113.0/24 is 192.168.113.1 #
#network_gateway = "192.168.113.1"

# Network DNS list (if empty, network gateway is set as a DNS) #
network_dns_list = [
  "192.168.1.1",
  "8.8.8.8"
]

# Other stuf...

master_nodes = [
  {
    id  = 1
    ip  = "192.168.1.150"
    mac = "52:54:00:00:00:10"
  }
]

# Other stuf...
worker_nodes = [
  {
    id  = 1
    ip  = "192.168.1.151"
    mac = "52:54:00:00:00:11"
  }
]

If dhcpv4: true is needed even with static IPs then I will give it one more try that, but I am sure I am doing some other mistakes somewhere.

Thank you so much for your help @MusicDin.

Din Mušić · Answer 5 · Wed Mar 02 2022 23:09:07 GMT+0800 (China Standard Time)

You don't need to enable dhcp4 if you don't use it.

Otherwise, both configurations seem valid to me.

How long did you let the script run before you stopped it?
If you stop the script too early, it may be that the qemu agent has not yet reported a received IP address.
For example, you can sometimes see this when all VMs receive the IP address after exactly 2 minutes.
For this reason I recommend you to let the script run until it terminates itself (max. 5 minutes).

Please let me know if this solves your problem or what error is reported at the end?

Elis Lulja · Answer 6 · Wed Mar 02 2022 23:21:15 GMT+0800 (China Standard Time)

I always let it run, it terminates on its own after 5 minutes, and the error that I posted on first post appears.

Elis Lulja · Answer 7 · Wed Mar 02 2022 23:39:18 GMT+0800 (China Standard Time)

Anyways, I think this has more something to do with the terraform libvirt-provider, I will try to follow some of the related issues on their repository (e.g. dmacvicar/terraform-provider-libvirt#924) and will let you know in case. Thank you! :)

Din Mušić · Answer 8 · Thu Mar 03 2022 19:06:46 GMT+0800 (China Standard Time)

I was able to recreate this issue.

For example, I have my network configured as follows:
CIDR (for LAN network): 10.10.0.0/20
GW (router's IP): 10.10.0.1

If I enter the following values when creating the cluster, the cluster gets successfully created:

# terraform.tfvars

network_mode = "bridge"
network_bridge = "br0"
network_cidr = "10.10.0.0/20"
network_gateway = "10.10.0.1" # In this case, GW can be omitted
...
master_nodes = [
  {
    id  = 1
    ip  = "10.10.6.5"
  }
]
...
worker_nodes = [
  {
    id  = 1
    ip  = "10.10.6.6"
  }
]

If I enter the wrong GW IP, the addresses are not retrieved and I get the same error message as you.
The same thing happens if the wrong network CIDR is specified. For example, if I enter network_cidr = "10.10.0.0/22", I again get the same error as you.

Can you verify that you enterd the correct CIDR and GW?

CIDR should be your LAN network with the appropriate subnet mask network bits,
GW should point to the router of the specified network,
the static addresses assigned to the VMs should be within the specified network.

Elis Lulja · Answer 9 · Thu Mar 03 2022 21:05:18 GMT+0800 (China Standard Time)

Will check this out asap, thank you! :)

Din Mušić · Answer 10 · Tue Mar 08 2022 15:24:25 GMT+0800 (China Standard Time)

Hi,

can you let me know if the above solved your problem? Thanks.

Elis Lulja · Answer 11 · Thu Mar 10 2022 04:37:16 GMT+0800 (China Standard Time)

Hi @MusicDin, so sorry for not replying sooner. I double-checked everything and the values are indeed correct but still had the problem, but after what you wrote I am more convinced that the problem is more a misconfiguration of mine some where else rather than the script itself.

BTW, I have a proxy server and modified your scripts to inject proxy environment variables in the cluster appropriately, and so in nat mode everything works fine. My guess -- but I may be wrong -- is that maybe the qemu agent cannot contact the node because the proxy, at that point, is not configured in the guest yet, and so communication to the host is blocked. Do you think this could be the case?

Anyways, I have reverted to using nat mode as it is still acceptable for my use case for now :)

Din Mušić · Answer 12 · Fri Mar 11 2022 16:14:25 GMT+0800 (China Standard Time)

In general, I don't think proxy is a problem because if your bridge interface gets its own IP address, so should the virtual machines. This is just a guess though, as I've no idea how your network is implemented.

I'm still not able to reproduce the issue other than with incorrect values, so it seems to me that it needs further investigation on your end. If the NAT mode is sufficient for your needs, that should do for now.

Please let me know if you have any more questions or information about this problem.

Din Mušić · Answer 13 · Fri Mar 11 2022 18:21:32 GMT+0800 (China Standard Time)

One more question @asimpleidea - can you please tell me which hypervisor you're using and which OS image you're installing on the nodes?

Elis Lulja · Answer 14 · Sat Mar 12 2022 03:49:24 GMT+0800 (China Standard Time)

I am using ESXi and if I remember correctly the images were Ubuntu 16.04, I may try another time with 20.04 though.

So to conclude, I agree with you that I will have to investigate further and will let you know if I have other news :)
Thanks so much @MusicDin !

Din Mušić · Answer 15 · Mon Mar 14 2022 18:14:44 GMT+0800 (China Standard Time)

Thanks again for the provided information and opening the issue.

I'm close it for now, but fell free to reopen if you have something new.