beringresearch / macpine

Lightweight Linux VMs on MacOS

Home Page:https://beringresearch.github.io/macpine/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`Start` is very slow

maxzinkus opened this issue · comments

In testing #78, I also noticed that the Retry loop surrounding Exec("hwclock -s") fails a few times every boot on my machine. It looks like it's caused by startup not yet being ready -- that is, the SSH server hasn't completely come up.

Overall (especially with the current exponential backoff on Retry -- changed in #73) this means that Booting... takes around 2+4+8= 14 seconds.

If this loop is skipped, the hwclock -s is not run -- meaning the system time will not be reset based on the RTC. However, given #78, it seems that the RTC is wrong regardless after sleep and therefore this step might not be necessary.

If the loop is skipped, Start reaches qemu/ops.go:354 before qemu-system-... has finished starting the VM. It then checks for the pidfile which is not yet present and determines that Start failed.

However, the VM is still correctly started, and by the time the user follows up with alpine ssh ..., qemu-system-... has completed startup, the pidfile has been created, and the ssh server is ready.

Whatever the resolution is to #78 (either no change or some change), this Retry loop should instead await for one of two things to happen:

  1. qemu-system-... exits (likely with an error/after receiving a signal)
  2. alpine.pid is created (which qemu-system- seems to do late in the boot process)

If 1 occurs, the VM surely did not start correctly and the pidfile should be missing (TODO: check if pidfile is cleaned up after sigkill).

If 2 occurs, the VM likely started correctly.

This way, Start can wait asynchronously for bootup to finish, and then proceed when ready after a host-system-speed dependent amount of time. Instead, on slow systems, Retry will spend many seconds sleeping after multiple failures, and on fast systems, Retry will spend some seconds sleeping after a few failures -- in both cases waiting precious cycles and time.

@idroz you can assign me here and I'll make a PR when I can, or close if I'm missing something important here. I hope the reduced Retry times in #73 don't make Start fail on slower systems than mine, but if they do, fixing this issue will fix that too.

Is this related to what I'm seeing right now with over 10 minutes of dots?

$ alpine start simplistic-background
2023/02/28 19:56:40 Booting...................

It's 20:07:XX as of positing this.

Is there a way to verbose output the booting? I'm worried my vm has been corrupted some how.

It's not. I mean, it kind of is, but not really.

Likely, Start (the thing that boots the VM) is trying to SSH into the VM to run a command as part of the boot process. That SSH is failing in a Retry loop that waits a long time after failures. That is part of what this issue plans to fix, but it's not the root cause of your bug.

  • In ~/.macpine/simplistic-background/config.yaml, are sshuser and sshpassword correct?
  • Is the root password still the default (root)?
  • Is sshport correct and is anything on your host system listening on that port already?
  • Did you modify /etc/ssh/sshd_config inside the VM at all?

I don't want to take away from the original problem, but I waited it out and got this

2023/02/28 20:14:00 unable to sync clocks: after 10 attempts, last error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain

but checking

In ~/.macpine/simplistic-background/config.yaml, are sshuser and sshpassword correct?

was the correct fix.

@idroz Unfortunately, in testing on my machine the PR #93 only shaves a few seconds off the start time. If you have a chance to test, I'm curious about what you see on a different machine.

I guess it really does just take a while for the SSH server to come up :/