nodejs / build

Better build and test infra for Node.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MacStadium maintenance window on January 23rd

UlisesGascon opened this issue · comments

As described in ticket: SERVICE-176962

Dear OpenJS

On Tuesday, January 23rd, 2024, at 9 AM ET, we need to conduct a one-hour maintenance in our ATL data center that will impact your ORKA cluster for one hour. We apologize in advance.

Before the start of the maintenance, please save and shut down any VMs in advance of the maintenance start.

We will notify you once the nodes are back up here in the ticket. Again, we apologize in advance for any inconvenience this may cause. Thank you for your understanding.

Potential affected machines:

Next steps

I am not sure that I will be able to manage the "save and shut down" for the VMs before the deadline (tomorrow), anyone is available to do it (@nodejs/build)?

test-orka-macos10.15-x64-1:

  • Restore test-orka-macos10.15-x64-1 in Orka cluster
  • Reansible test-orka-macos10.15-x64-1
  • re-enable test-orka-macos10.15-x64-1 in Jenkins
  • Save the state
  • commit the image changes

test-orka-macos10.15-x64-2:

  • Restore test-orka-macos10.15-x64-2 in Orka cluster
  • Reansible test-orka-macos10.15-x64-2
  • re-enable test-orka-macos10.15-x64-2 in Jenkins
  • Save the state
  • commit the image changes

test-orka-macos11-x64-1:

  • Restore test-orka-macos11-x64-1 in Orka cluster
  • Reansible test-orka-macos11-x64-1
  • re-enable test-orka-macos11-x64-1 in Jenkins
  • Save the state
  • commit the image changes

test-orka-macos11-x64-2:

  • Restore test-orka-macos11-x64-2 in Orka cluster
  • Reansible test-orka-macos11-x64-2
  • re-enable test-orka-macos11-x64-2 in Jenkins
  • Save the state
  • commit the image changes

release-orka-macos11-x64-1:

  • Restore release-orka-macos11-x64-1 in Orka cluster
  • Reansible release-orka-macos11-x64-1
  • Manual steps on release-orka-macos11-x64-1
  • re-enable release-orka-macos11-x64-1 in Jenkins
  • Save the state
  • commit the image changes

Update (9 AM ET):

We are beginning the maintenance and will update you once completed.

Update (10 AM ET):

The maintenance is now completed. Thank you.

We will need to recover the machines manually in order to make the Orka cluster working again. cc: @nodejs/build.

I am not available today, but I can try to work on it tomorrow (potentially), but feel free to take leadership if you want.

IMPORTANT: You can use this table (#3240 (comment)) as a reference to know where to locate the vms within the cluster in order to align the VMs with the inventory

I am not available today, but I can try to work on it tomorrow (potentially), but feel free to take leadership if you want.

I am afraid that I won't be able to work on it today, I will start to work on it only from next Monday. 😓

@UlisesGascon thanks for working on it. One question is if the machine recovery is needed because they were not shut down properly (I only noticed the original issue too late to help out) or if that would have been required regardless?

One question is if the machine recovery is needed because they were not shut down properly (I only noticed the original issue too late to help out) or if that would have been required regardless?

This situation is a bit tricky, drawing from past experiences such as #3112. The VMs allocated in specific slots, including port mapping, are expected to be shut down and effectively 'removed' from the Orka cluster nodes.

Once the cluster is back, a manual relocation process is necessary to create new VMs using the images. This ensures the correct slots are filled, maintaining the expected mapping from the inventory and Jenkins (IPs and ports).

In this case, we didn't save and shut down the VMs before the process. Consequently, I suspect that the images, due to the destruction of VMs, might be an older version of the existing VMs. This will require re-ansibleing each VM once deployed and some manual configuration, particularly with the Jenkins tokens, depending on the state of the images.

I'll have a clearer picture on Monday. Unfortunately, I haven't been able to connect yet to check the status of the cluster or the nodes after the upgrade.

@UlisesGascon I don't think I'm up to speed enough to do the bring back, but if a second set of hands would be helpfull when you have time to look at it and I'm around I'm happy to get on a call and help if that make sense.

I will start to work on it now

10.15 machines are back. I am working to re-ansible Macos11 VMs, but the process is taking time

I'm currently facing some challenges with LLVM installation on macOS11. The build process seems unusually time-consuming, taking hours (whereas I recall it used to be around 30 minutes in the past). The process was so lengthy that the Ansible SSH connection generated a timeout. So, I just changed the strategy and execute this step manually (via SSH).

Screenshot 2024-01-29 at 18 20 42

I'm also puzzled about why the applied patch is ...arm64... since these are Intel machines. I've decided to let the process run overnight to see if the build generates any errors or if it finalizes properly.

So, the machines made some progress during the night. Currently the machines continue installing dependencies (after restoring SSH sessions due timeouts), not sure why is so slow, but we are making progress.

I think these long compile/install steps are due to Homebrew removing support for outdated macOS (it has to install deps from source instead of downloading prebuilt binaries).

I think these long compile/install steps are due to Homebrew removing support for outdated macOS (it has to install deps from source instead of downloading prebuilt binaries).

This makes totally sense. We need to commit the image changes after this process because the recovery process is very long.

I am getting issues with the manual steps in release-orka-macos11-x64-1. sudo xcodebuild -license is hanging also git. Not sure what can be the issue. 🤔

The ansible process worked fine, I will finish soon with the manual steps for release-orka-macos11-x64-1

So, release-orka-macos11-x64-1 seems to be working. I re-run this canary build to check that the machine is working as expected. This will unblock the releases 🥳

I am still working on macos11 test machines, the dependencies build is quite long

🥳 test-orka-macos11-x64-1 and test-orka-macos11-x64-2 are back!

I will commit the image changes once the queue is reduced to zero, to avoid making more bottleneck effects in the PRs.

Here are the first jobs from the queue, I will check they are passing before doing the commit of the images:

Update: the CI jobs were fine as far I can see.

I will start with the image commit, so.. I will disconnect eventually the machines from Jenkins while doing the commit.

I got an error while connecting to the VPN. I created a support ticket SERVICE-178721.

The login error got solved, but I needed to open a separate ticket to ask for support as I am getting errors while saving the changes, Ticket SERVICE-178790

Current status

I've cleaned up all the MacOS machines in ORKA as they were starting to generate space issues again. Additionally, I've created a state save for each machine.

Initially, I thought I needed to push this state to the VM images. However, some of them are using a common image, so it might not be a good idea. I've asked support what the best strategy is to maintain the VMs in this state regardless of any changes. I am waiting for a final response before closing this issue.
Screenshot 2024-02-27 at 15 15 07