illinois-cs241 / broadway-api

This is the old repo for Broadway API. Please see the new repo for newest version of Broadway https://github.com/illinois-cs241/broadway

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Worker Node Display erroneous after restart of a worker node

ayushr2 opened this issue · comments

Currently, when a new worker node joins, we create a new entry for the node in the DB as:

{
     _id: abcd
     hostname: host1
     alive: true, ....
}

Then if that worker dies and is restarted, the DB now looks like:

{
     _id: abcd
     hostname: host1
     alive: false, ....
},
{
     _id: efgh
     hostname: host1
     alive: true, ....
}

So when a course checks the system health using the /api/v1/worker/<course>/all endpoint, it returns the DB contents which has two entries for the same host.

We want to keep the dead workers on the DB since grading jobs have a field worker_id which contains the id of the worker node which executed that job. This is for debugging purposes so we can see if a worker node is not functioning properly (if all grading jobs from that worker nodes have unexpected results).

One possible way we can solve this by making the _id of the worker node as the hostname. This will enforce that a hostname can only have one worker node (which usually is preferred since we want the workers to have as much CPU access as possible).

After some discussion the preferred solution would be to do the following:

  • Currently, we assign the worker ID when a node registers. Now we will let the node decide what their ID is. So if a dead node registers again with the same ID we mark them as alive.
  • At the core the issue was we were not allowing re-registration. This solution will facilitate that.

Downsides:

  • There is a potential 20 seconds delay between a node dying and the API declaring that the node is dead. So for those 20 seconds, the dead node will not be able to re-register.

Implementation:

  • Change the Worker Registration endpoint:
    Endpoint - /api/v1/worker/[worker_id]
    Method - POST
    Body:
{
     "hostname": <hostname of machine>
}
  • if the ID exists and the node is alive, abort register request saying the ID is duplicate
  • if the ID exists and the node is dead, classify as re-register and mark the node as alive again
  • if the ID does not exist, mark as a new register, create a new Mongo Document

since we are letting the user specify the ID, during document creation, we should specify the _id field so that mongo does not generate another default ID for that document.

illinois-cs241/broadway-grader#24 needs to be resolved for this change to be deployed.