nodejs / build

Better build and test infra for Node.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Windows release machines are all offline

targos opened this issue · comments

Hey @targos thanks for letting me know. The exception I see on the machines is the following:

INFO: Could not locate server among [https://ci-release.nodejs.org/]; waiting 10 seconds before retry
java.io.IOException: https://ci-release.nodejs.org/ provided port:11111 is not reachable on host ci-release.nodejs.org
        at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:304)
        at hudson.remoting.Engine.innerRun(Engine.java:809)
        at hudson.remoting.Engine.run(Engine.java:563)

As I recall, when new machines were added to the release CI, firewall rules were added for them. Is it possible that those rules were removed/edited recently? From what I see this started on Friday/Saturday.

P.S. I've changed the actual port with 11111 to keep the real one a secret.

According to the command history on ci-release, @richardlau recently touched the iptables rules.

According to the command history on ci-release, @richardlau recently touched the iptables rules.

That would have been over a week ago, before the collab summit (for #3663).

According to the command history on ci-release, @richardlau recently touched the iptables rules.

That would have been over a week ago, before the collab summit (for #3663).

So it does look like the IP addresses from the Windows rackspace machines do not match the inventory in richard-20240326, which is the backup I edited for #3663. This was taken from /etc/iptables/rules.v4 on ci-release. I'm guessing this only manifested over the weekend because the machines self-updated/rebooted (the edit was made over a week ago)?

I'll update the firewall with the IP addresses from the inventory/secrets.

Thanks for the update @richardlau, I was just about to say that running ping ci-release.nodejs.org works, so the port is the issue, thus the firewall. After you fix it, should we let all of the started builds and update jobs finish (everything will be back to normal by tomorrow), or would you prefer to cancel queued jobs?

I've updated, and made sure the changes are reflected in /etc/iptables/rules.v4. It looks like the machines are online now in Jenkins and are picking up jobs -- let's let them run and keep an eye out for any issues.

The queue is emptied. Everything seems to be back to normal. I'll close this issue in 1-2 days if no incident occurs.

FWIW I think this is a new/separate problem, but today's nightly build failed on vs2022-arm64: https://ci-release.nodejs.org/job/iojs+release/10098/nodes=vs2022-arm64/consoleFull with

07:02:12 c:\ws\deps\simdutf\simdutf.cpp(16719,7): error C2664: '__n128x4 neon_ld4m_q8(const char *)': cannot convert argument 1 from 'const uint8_t [64]' to 'const char *' [c:\ws\deps\simdutf\simdutf.vcxproj]
07:02:12 c:\ws\deps\simdutf\simdutf.cpp(16719,7): message : Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or parenthesized function-style cast [c:\ws\deps\simdutf\simdutf.vcxproj]
07:02:12 C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.37.32822\include\arm64_neon.h(6146,10): message : see declaration of 'neon_ld4m_q8' [c:\ws\deps\simdutf\simdutf.vcxproj]
07:02:12 c:\ws\deps\simdutf\simdutf.cpp(16719,7): message : while trying to match the argument list '(const uint8_t [64])' [c:\ws\deps\simdutf\simdutf.vcxproj]
07:02:12 c:\ws\deps\simdutf\simdutf.cpp(16719,73): fatal  error C1903: unable to recover from previous error(s); stopping compilation [c:\ws\deps\simdutf\simdutf.vcxproj]

I don't think this occurred on the test CI, although https://ci.nodejs.org/job/node-compile-windows/55318/nodes=win-vs2022-arm64/consoleFull failed with

08:58:52 C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.39.33519\include\tuple(47,90): fatal  error C1060: compiler is out of heap space [C:\workspace\node-compile-windows\node\tools\v8_gypfiles\v8_initializers.vcxproj]

simdutf8 was updated in nodejs/node#52381 but the test CI runs for that passed.

I opened simdutf/simdutf#407 for the simdutf error. I think we can close this issue as it's resolved.