IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigate the recurring issue with Asynchronous execution as observed in Harvard production.

landreev opened this issue · comments

An "investigate and diagnose" issue to keep track of the effort.
I'm opening the issue in this queue since it is oddly specific to our prod. instance. But depending on what we find it may require dev. work in the main repo to fix.

This apparently happens regularly - something breaks, every weekend, in a way that prevents one of the nodes from being able to execute any asynchronous tasks. As the most apparent symptom, no datasets can be published on one of the nodes - the dataset stays locked with a finalizePublication lock. The same dataset can be unlocked and published on the other node, app-2. A restart of app-1 fixes the issue.

There's evidence that this is not related to dataset publication or the way these locks are created in the process. Rather, it appears that the application simply stops executing everything @Async. I.e. it's not an issue with the FinalizeDatasetPublicationCommand bombing, but rather the @Async-attribute method that's supposed to execute the command not running at all. This morning I confirmed that it was the case with running a harvest. (Nothing happened when trying to run a harvest from the dashboard on app-1; same harvest runs fine when started on app-2). There are very few things in common between what the FinalizePublication command does, and a harvest run, aside from the @Async attribute.

There's a (less confirmed) report that some datasets that were moved between collections this am didn't get reindexed (also an async operation?).

Common sense suggests that this must be caused by something that's actually run on app-1 over the weekend. It is not possible to run anything on a specific node from the outside, so it has to be something run/scheduled internally. One such not yet eliminated suspect is the timer that's run on app-1 exclusively and a bunch of harvesting jobs scheduled over the weekend. It may or may not have something to do with the fact that the harvests themselves appear to not be running either.

Payara health monitoring is enabled on app-1 - no health warnings in the log.

To be precise,

It is not possible to run anything on a specific node from the outside ...

Not purposefully; however, the first node must be naturally favored by the ELB. So if there is some burst of activity - somebody's script uploads/updates/publishes a bunch of things at a specific time - it may have an objectively better chance to kill the first node, rather than the second.