Investigate the recurring issue with Asynchronous execution as observed in Harvard production.

Question

Investigate the recurring issue with Asynchronous execution as observed in Harvard production.

landreev opened this issue 5 months ago · comments

An "investigate and diagnose" issue to keep track of the effort.
I'm opening the issue in this queue since it is oddly specific to our prod. instance. But depending on what we find it may require dev. work in the main repo to fix.

This apparently happens regularly - something breaks, every weekend, in a way that prevents one of the nodes from being able to execute any asynchronous tasks. As the most apparent symptom, no datasets can be published on one of the nodes - the dataset stays locked with a finalizePublication lock. The same dataset can be unlocked and published on the other node, app-2. A restart of app-1 fixes the issue.

There's evidence that this is not related to dataset publication or the way these locks are created in the process. Rather, it appears that the application simply stops executing everything @Async. I.e. it's not an issue with the FinalizeDatasetPublicationCommand bombing, but rather the @Async-attribute method that's supposed to execute the command not running at all. This morning I confirmed that it was the case with running a harvest. (Nothing happened when trying to run a harvest from the dashboard on app-1; same harvest runs fine when started on app-2). There are very few things in common between what the FinalizePublication command does, and a harvest run, aside from the @Async attribute.

There's a (less confirmed) report that some datasets that were moved between collections this am didn't get reindexed (also an async operation?).

Common sense suggests that this must be caused by something that's actually run on app-1 over the weekend. It is not possible to run anything on a specific node from the outside, so it has to be something run/scheduled internally. One such not yet eliminated suspect is the timer that's run on app-1 exclusively and a bunch of harvesting jobs scheduled over the weekend. It may or may not have something to do with the fact that the harvests themselves appear to not be running either.

landreev · Answer 1 · Tue Jan 23 2024 03:06:18 GMT+0800 (China Standard Time)

Payara health monitoring is enabled on app-1 - no health warnings in the log.

landreev · Answer 2 · Tue Jan 23 2024 03:10:58 GMT+0800 (China Standard Time)

To be precise,

It is not possible to run anything on a specific node from the outside ...

Not purposefully; however, the first node must be naturally favored by the ELB. So if there is some burst of activity - somebody's script uploads/updates/publishes a bunch of things at a specific time - it may have an objectively better chance to kill the first node, rather than the second.

landreev · Answer 3 · Tue Jan 23 2024 06:37:07 GMT+0800 (China Standard Time)

Meant to add this here:
https://stackoverflow.com/questions/17175229/where-can-i-configure-the-thread-pool-behind-the-asynchronous-calls-in-java-ee/40507428#40507428 (via Jim)