Station Core crashed and did not restart

Question

Station Core crashed and did not restart

bajtos opened this issue 3 months ago · comments

Station Desktop version v1.6.0.

I left my Mac in sleep mode during the last night. macOS wakes up the computer every 1-2 hours to process incoming notifications, etc. This creates brief windows where programs run and have access to the network.

Here is what I found in my activity log in the morning: Spark and Voyager exited via SIGTERM but did not start again.

Excerpt from the module logs (full file attached later):

[2024-05-21T21:11:30Z INFO  module:spark/main] Sleeping for 78 seconds before starting the next task...
{"type":"activity:error","module":"Zinnia","message":"Voyager has been inactive for 5 minutes, restarting..."}
{"type":"activity:info","module":"Zinnia","message":"Voyager exited via signal SIGTERM"}
{"type":"activity:info","module":"Zinnia","message":"Spark exited via signal SIGTERM"}
Zinnia main loop ended
WARN: Write to InfluxDB failed (attempt: 1). Error: connect ECONNREFUSED 3.123.149.45:443
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16) {
  errno: -61,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '3.123.149.45',
  port: 443
}

and eventually

TypeError: fetch failed
    at node:internal/deps/undici/undici:12345:11
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async S (file:///Applications/Filecoin%20Station.app/Contents/Resources/core/node_modules/w3name/dist/index.mjs:1:2561)
    at async Module.q (file:///Applications/Filecoin%20Station.app/Contents/Resources/core/node_modules/w3name/dist/index.mjs:1:2298)
    at async getContractAddresses (file:///Applications/Filecoin%20Station.app/Contents/Resources/core/lib/zinnia.js:126:20)
    at async RetryOperation._fn (file:///Applications/Filecoin%20Station.app/Contents/Resources/core/node_modules/p-retry/index.js:57:20) {
  cause: AggregateError
      at internalConnectMultiple (node:net:1114:18)
      at afterConnectMultiple (node:net:1667:5) {
    code: 'ECONNREFUSED',
    [errors]: [ [Error], [Error] ]
  },
  attemptNumber: 11,
  retriesLeft: 0
}
Failed to get contract addresses. Retrying...
Usage: Filecoin Station Helper [options]

Options:
  -j, --json          Output JSON                                      [boolean]
      --experimental  Also run experimental modules                    [boolean]
  -v, --version       Show version number                              [boolean]
  -h, --help          Show help                                        [boolean]

TypeError: fetch failed
    at node:internal/deps/undici/undici:12345:11
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async S (file:///Applications/Filecoin%20Station.app/Contents/Resources/core/node_modules/w3name/dist/index.mjs:1:2561)
    at async Module.q (file:///Applications/Filecoin%20Station.app/Contents/Resources/core/node_modules/w3name/dist/index.mjs:1:2298)
    at async getContractAddresses (file:///Applications/Filecoin%20Station.app/Contents/Resources/core/lib/zinnia.js:126:20)
    at async RetryOperation._fn (file:///Applications/Filecoin%20Station.app/Contents/Resources/core/node_modules/p-retry/index.js:57:20) {
  cause: AggregateError
      at internalConnectMultiple (node:net:1114:18)
      at afterConnectMultiple (node:net:1667:5) {
    code: 'ECONNREFUSED',
    [errors]: [ [Error], [Error] ]
  },
  attemptNumber: 11,
  retriesLeft: 0
}

Full log file:
station-modules-1716355317414.log

Miroslav Bajtoš · Answer 1 · Wed May 22 2024 13:33:03 GMT+0800 (China Standard Time)

We should move the code fetching contract addresses outside the Zinnia module, so that we don't restart the loop when we need to restart Zinnia modules. We should also implement an infinite retry so that a failure to fetch contract addresses does not crash Station Core.

Finally, in Station Desktop, it would be great to implement an automatic restart of Station Core.

@juliangruber WDYT?

Miroslav Bajtoš · Answer 2 · Wed May 22 2024 14:15:45 GMT+0800 (China Standard Time)

We should move the code fetching contract addresses outside the Zinnia module, so that we don't restart the loop when we need to restart Zinnia modules. We should also implement an infinite retry so that a failure to fetch contract addresses does not crash Station Core.

We have already moved runUpdateContractsLoop from lib/zinnia to lib/station 👍🏻

Now, we need to add an error handler to ignore failures. I opened a PR for that: filecoin-station/core#474

Julian Gruber · Answer 3 · Wed May 22 2024 14:17:10 GMT+0800 (China Standard Time)

Station Desktop 1.6.0 has outdated Station Core (20.4.1 vs 20.6.0), and these possibly related changes happened in Core:

move rewards update loop out of zinnia loop
refactor: code cleanup after rewards loop move

Can you reproduce the issue with Station Desktop 1.7.0?

We should move the code fetching contract addresses outside the Zinnia module, so that we don't restart the loop when we need to restart Zinnia modules.

This already happened: https://github.com/filecoin-station/core/blob/eaac0cb1bdb5ae2a8f8d7d2cc8726ecf6ccbf879/commands/station.js#L144

We should also implement an infinite retry so that a failure to fetch contract addresses does not crash Station Core.

We currently have 10 retries configured, which I believe due to exponential backoff is already really long. I don't think it harms to go up further though.

Here is what I found in my activity log in the morning: Spark and Voyager exited via SIGTERM but did not start again.

I think this is the main issue here. Why did it not restart? I'm going to take a code read

Julian Gruber · Answer 4 · Wed May 22 2024 14:19:14 GMT+0800 (China Standard Time)

I think this is the main issue here. Why did it not restart? I'm going to take a code read

I hope this is fixed by upgrading to Station Desktop 1.7.0. Please reopen if you disagree

Miroslav Bajtoš · Answer 5 · Wed May 22 2024 14:19:21 GMT+0800 (China Standard Time)

Here is what I found in my activity log in the morning: Spark and Voyager exited via SIGTERM but did not start again.

I think this is the main issue here. Why did it not restart? I'm going to take a code read

My understanding is that Station Core 1) detected that Voyager is inactive 2) initiated the restart loop 3) crashed because it was not able to fetch contract addresses.

Miroslav Bajtoš · Answer 6 · Wed May 22 2024 14:20:38 GMT+0800 (China Standard Time)

I think this is the main issue here. Why did it not restart? I'm going to take a code read

I hope this is fixed by upgrading to Station Desktop 1.7.0. Please reopen if you disagree

What do you think about improving Station Desktop to restart Station Core if it crashes?

Julian Gruber · Answer 7 · Wed May 22 2024 14:23:51 GMT+0800 (China Standard Time)

Here is what I found in my activity log in the morning: Spark and Voyager exited via SIGTERM but did not start again.

I think this is the main issue here. Why did it not restart? I'm going to take a code read

My understanding is that Station Core 1) detected that Voyager is inactive 2) initiated the restart loop 3) crashed because it was not able to fetch contract addresses.

You're right! Sorry I didn't take a close enough look at the logs.

I think this is the main issue here. Why did it not restart? I'm going to take a code read

I hope this is fixed by upgrading to Station Desktop 1.7.0. Please reopen if you disagree

What do you think about improving Station Desktop to restart Station Core if it crashes?

Also missed this one 😅 +100. I'm not actually surprised that we got this far without implementing this one. Shall we propose it for M4.5?

Julian Gruber · Answer 8 · Wed May 22 2024 14:24:21 GMT+0800 (China Standard Time)

Added to space-meridian/roadmap#104