nodejs / build

Better build and test infra for Node.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nearform can no longer host machines

mhdawson opened this issue · comments

Creating this to capture/track as opposed to email discussion which is harder to pull people into.

Nearform has let the build WG know through email that they can no longer host the machines they had in our datacenter. These include

  • 2 Windows on ARM machines
  • 3 OSX machines
  • 2 Large benchmarking machines.

They have proposed moving then to another hoster which would cost $3856 Euros as a move cost and then $850 Euro per month as an ongoing cost.

From informal discussion so far we believe we don't need the Windows on ARM machines as they have been replaced by machines in Azure. That may make the cost a bit lower.

The options going forward at a high level would be:

  • Move all the machines (minus windows on ARM), proposal from NearForm
    outlined above which would require foundation Funding.
  • Find a new host/company willing to sponsor hosting/or host the machines for free.
  • Find replacements (minus windows on ARM)
    • Possibly reach out to MacInCloud for OSX machines (1 ARM and two x86)
    • Find new sponsor for large benchmarking machines.

Initial discussion is that we don't believe we can/should just create the larger machines in existing hosters. As part of this process we should also confirm with the performance team what size machines are actually needed.

Given that there have been discussions with the Foundation/Linux IT team about them helping to manage machines and their stated approach of "fully owning" what they manage it would be good to see if Linux IT can take on solving this time sensitve issue for the project.

@bensternthal could to you take on getting Linux IT to give us a yes/no in terms of taking this on, ideally in a timeframe needed by Nearform.

@efrisby could you share what the required timeframe for a move is?

Hi Michael,

We have a Fibre line that needs to be removed that controls the fixed ip addresses that are currently on the servers. We can keep this in place for a period of time, but we would be hoping to shut down in under 2 months.

So if we could set the 1st of April as the deadline, would that give enough time to address the above?

Thanks,
Eamonn

@nodejs/performance FYI

Hi Michael, Ryan Aslett from LF IT here.

I've been doing a bit of background research on this and wanted to make sure I understood what the requirements are.

If I understand the situation correctly, these are physical machines that Nearform is hosting for nodejs in their datacenter, which they can no longer continue to support having on their network.

My understanding of the 2 Windows on ARM machines, (Surface Pro)
https://github.com/nodejs/build/blob/main/ansible/inventory.yml#L227-L228 which were donated/on loan from ARM (#2540 (comment) and #2540 (comment)) and have since been decommissioned in favor of resources at Azure. Looks like there is still some question as to what state they're in (#3286). Seems prudent that those either get returned to ARM or dealt with some other way.

The OSX machines:
It also seems like theres 4(?) OSX machines, two x86 ones, with 2 VM's running on them via VMware fusion, and 2 ARM based ones that I believe are just bare metal machines (Couldn't find any info about the state of any VM's on the ARM Mac Mini's in the issue history, other than the fact that the IP's in the inventory match what @efrisby mentioned here: #3390 (comment)).

One of the x86 mac mini's VM's were split between release and test for 10.15, and the other has 2 vm's dedicated to test of 10.15-x64

The release machine was retired because 10.15 isnt able to run xcode13 and notarize.

It looks like there were some recent experiments to get 10.15 x64 tests to run on orka: https://ci.nodejs.org/computer/test%2Dorka%2Dmacos10.15%2Dx64%2D1/builds .

for the 2 ARM based mac minis, it seems like the test one has been unused for the last 11 days:
https://ci.nodejs.org/computer/test%2Dnearform%2Dmacos11.0%2Darm64%2D1/builds, and its jobs are being run on macstadium nodes already: https://ci.nodejs.org/computer/test%2Dmacstadium%2Dmacos11.0%2Darm64%2D4/builds

I do not have access to the release jenkins, so Im not sure what the status is of the nearform ARM release machine, other than it seems like we now have two functioning release machines for macos11-x64-1 #3179 (comment).

Given that, I wonder if we already have capacity at Macstadium an Orka to handle the roles these Nearform OSX machines are performing ? (Though, perhaps we might need another additional orka testrunner for 10.15/x64)

Would the goal in pursuing MacInCloud or another provider (i.e. sponsored https://aws.amazon.com/pm/ec2-mac/ ) be mostly for redundancy and resiliency against provider outages?

** Large Benchmarking Machines **
Other than the specs of the machines themselves,
#791 (comment), Im not sure how these are used, and whether or not there is a requirement to keep those specific machines as the benchmark machines, or if moving to another resource is an option.

Would changing the benchmark infra be an option? can those be virtual/cloud based machines, or is bare metal a requirement?

In any case I look forward to helping get this figured out.

Would changing the benchmark infra be an option? can those be virtual/cloud based machines, or is bare metal a requirement?

Virtual/is really not an option. However any "bare metal" host would do. I personally use a Hetzner machine for similar purposes (it's significantly worse/slower, but we need the consistency of results, not the actual speed).

In terms of resources, we could do with similar specs (I don't have those handy), or even something a bit less powerful.

My 2 cents is that those machines are likely near end-of-life.

Is there something speaking against using github runners?

Anything running on VMs have too much interference and the standard deviation between runs is too high to measure bytecode level optimizations, e.g. microbenchmarks.

Is there something speaking against using github runners?

Anything running on VMs have too much interference and the standard deviation between runs is too high to measure bytecode level optimizations, e.g. microbenchmarks.

Github actions supports self hosted runners, even bare metal ones, but converting the Jenkins CI infrastructure to a Github Actions infrastructure is an ambitious undertaking that would be unlikely to succeed in the timeframe of this immediate need.

#2247 seems like a good place to continue discussing whether or not that's an eventual or possible outcome.

2 Windows on ARM machines

I agree it seems we don't need them anymore.

The OSX machines

Currently almost unused because the current version of Node.js doesn't support macOS 10.15. These machines could be updated to macOS 12, 13, or even 14.
I'm not sure we have enough capacity to replace them at Macstadium (we already struggle with disk space), but I would be more happy if we find other providers to donate resources (for example, Scaleway have bare metal M1 and M2 Pro mac minis).

The Intel benchmarking machines

The systems we have now are based on dual-CPU Intel Xeon E5-2699 v4.
The each have a total of 88 logical cores and 64 GB of RAM.
I'm also familiar with Hetzner machines at work and maybe we should try to ask them for sponsorship. They have machines with up to 64 cores, and datacenters in different countries.

I personally use a Hetzner machine for similar purposes (it's significantly worse/slower, but we need the consistency of results, not the actual speed).

Thanks for confirming - that was one of the questions I had when discussing this with some members of build yesterday - I figured that was likely the case. Obviously the implication is that we won't be able to compare "old" runs with "new" runs without re-running them, but that shouldn't be too much of a problem (we can always re-run if required).

Does the performance team require two systems or would one be adequate for the capacity needs?

Would the goal in pursuing MacInCloud or another provider (i.e. sponsored https://aws.amazon.com/pm/ec2-mac/ ) be mostly for redundancy and resiliency against provider outages?

I believe that's the primary driver, yes. AWS should also be a viable option if they were willing to sponsor us.

Regarding OSX testing:

Based on everything I've been able to glean from issues and meeting notes, it seems like a good path forward would be to lean into what we're doing with MacStadium for the short term, with an eye on having a secondary provider longer term.

  • What is the current status of our relationship with MacStadium?
  • How happy are we with the performance/reliability of their service?
  • Is it possible to leverage them entirely for the short term, are there any blockers?
  • Are there things on our backlog to do with them for OSX testing? (Things like leveraging their ephemeral instances, upgrading to orka 3.0, maybe using their jenkins plugin? I see a lot of recurring osx disk space / node health issues, which might be alleviated by using ephemeral instances - provide those can spin up and be ready for testing fast enough for the build pipeline)
  • If we, LFIT, wanted to access to the MacStadium account to audit instance sizes and requirements, what would be the process for that? (so we have a gauge on what to ask from a future provider)

Regarding Benchmark testing:

  • How time sensitive are benchmark tests? are they blockers to an immediate commit? or are they more like steps in a long release process?
  • How frequently are they used/needed?
  • One possibility is something like dedicated tenancy (metal) ec2 spot instances for the jenkins nodes to run those tests. They could spin up and down on demand, we could target a large and powerful enough size, but only have to pay for/use credits, for when we're actually running the tests, and we could work with AWS to find the sweet spot of "very available" and "performant".

How time sensitive are benchmark tests?

In order to land any performance related PR, we run the benchmarks.
Usually benchmarks run on dev machines are not effective.
I don't have actual stats, but I'd say we run them weekly.

Some of those jobs lasts 6-8 hours, and in the most extreme cases days.

are they blockers to an immediate commit?

The lack of benchmarking machines would slow down progress on most things performance related.

or are they more like steps in a long release process?

They are not part of our release process.


How frequently are they used/needed?

I's guess a few times per week.

One possibility is something like dedicated tenancy (metal) ec2 spot instances for the jenkins nodes to run those tests. They could spin up and down on demand, we could target a large and powerful enough size, but only have to pay for/use credits, for when we're actually running the tests, and we could work with AWS to find the sweet spot of "very available" and "performant".

One of the key strategies we employ is to rely on previous runs to compare.

I'd not really trust this setup, because the actual machine would change every time.

On top, AWS spot instances cost for c5.metal (seems a good choice in terms of resources) is likely 3x (or more) compared to a provider like Hetzner.

One of the key strategies we employ is to rely on previous runs to compare.

Do we? I thought each Benchmark CI job that is ran runs through the requested benchmark(s) twice -- once with the base branch (i.e. what is being compared to) and once with the PR being tested.

benchmark-node-micro-benchmarks runs this shell script.

Would the goal in pursuing MacInCloud or another provider (i.e. sponsored https://aws.amazon.com/pm/ec2-mac/ ) be mostly for redundancy and resiliency against provider outages?

Our goal across platforms has been to have at least two providers for any platform. So while we might be able to use 1 for a short period of time, the plan should be to find a second provider if at all possible.

What is the current status of our relationship with MacStadium?
How happy are we with the performance/reliability of their service?

I'd say the relationship is good and we are happy with the machines they have provided. I believe most of the common issues we have relate to OSX itself versus the host. Many thanks to MacStadium for their continued support.

Do we? I thought each Benchmark CI job that is ran runs through the requested benchmark(s) twice -- once with the base branch (i.e. what is being compared to) and once with the PR being tested.

Yes. We typically run the benchmark across different commits as a PR evolves. I'm not convinced that those result would be comparable across different HW.

Are only benchmarks run on those machines?

Expand > I'm not sure we have enough capacity to replace them at Macstadium (we already struggle with disk space), but I would be more happy if we find other providers to donate resources (for example, Scaleway have bare metal M1 and M2 Pro mac minis).

Pardon me if I'm intruding here, but if there is a need M1 or M2 runners for GitHub Actions, may I suggest giving FlyCI a try? We offer MacOS M1 and M2 runners (ARM64). For public repos, we offer 500 mins/month of free M1 usage (4 vCPUs, 7 GB RAM, 28 GB storage).

The setup is super easy:

  1. Install the FlyCI GitHub app.
  2. Give the FlyCI app permissions to this repo.
  3. Change your runs-on flag whenever you implement the ARM64 MacOS workflow:
jobs:
ci:
-    runs-on: macos-latest
+    runs-on: flyci-macos-large-latest-m1
      steps:
      - name: 👀 Checkout repo
         uses: actions/checkout@v4

Do you think this might be a good option for nodejs / build?

Web: flyci.net

Update:

I apologize, I just realized you guys are using Jenkins, not GitHub Actions. Please ignore my comments above!

Issues with proposal from Linux IT for how to move forward on replacing NearForm OSX machines - #3638

@UlisesGascon FYI