the simulator can't recover from certain errors

Question

the simulator can't recover from certain errors

bjuergens opened this issue 4 years ago · comments

It is possible to end up in situations, where a simulation can not be restarted (without restarting the entire process first). This is a problem for NRP, because some Threads are reused between experiments. One real-world example for such a situation is explained in #358.

Minimum Viable Example (+ Woraround)

import pyNN.spiNNaker as p
import os
import tempfile
import shutil

try:
    initial_work_dir = tempfile.mkdtemp()
    os.chdir(initial_work_dir)
    print(os.getcwd())
    p.setup()
    print("provoke an unrecoverable error")
    os.chdir(tempfile.mkdtemp())
    shutil.rmtree(initial_work_dir)
    p.end()
except Exception as e:
    print(e)

print(os.getcwd())

for i in range(5):
    print("it's not possible to recover, no matter how often we try")
    try:
        p.setup()
        p.end()
    except Exception as e:
        print(e)

try:
    print("explicitly resetting doesn't seem to help either")
    p.reset()
    p.setup()
    p.end()
except Exception as e:
    print(e)

print("workaround...")
from spinn_front_end_common.utilities import globals_variables
from spinn_front_end_common.interface.simulator_state import Simulator_State
if globals_variables.get_simulator()._state == Simulator_State.SHUTDOWN:
    globals_variables._simulator = None

p.setup()
p.end()
print("...done")

Björn Jürgens · Answer 1 · Thu Mar 05 2020 16:01:18 GMT+0800 (China Standard Time)

I'd like to know if this behavior is considered a bug, or if we should adjust NRP accordingly

Andrew Rowley · Answer 2 · Thu Mar 05 2020 16:26:06 GMT+0800 (China Standard Time)

I would say that yes, this is a bug. Is the issue that there is an exception and so the p.end() within the try...except block is never reached, or else an exception during p.end() meaning that the state isn't correctly reset?

Note that p.reset() is used within a p.setup() ... p.end() "block" to reset the simulation to time t = 0, so this is not related.

If I am right about the exception during a simulation or during p.end(), I think that the solution to this is going to be dependent on what sort of machine configuration you are using. The cases are:

Using a real machine. In this case, if there is an exception, calling p.setup() again is probably not going to work, as the machine is likely in an inconsistent state. If the BMP connection to the machine is provided however, the board could be reset in the case of an error. If not, the best that can be done is to ask the user to reset the board.
Using a virtual machine. In this case calling p.setup() again should be no issue.
Using an allocated (spalloc or HBP) machine. In this case calling p.setup() again could work since a new machine can be allocated and the old one powered off.

Christian Y. Brenninkmeijer · Answer 3 · Thu Mar 05 2020 17:00:08 GMT+0800 (China Standard Time)

Actually this is not a bug but a design choice. One that worked well to protect users when they where mainly using single boards and especially the 4 chip boards with no bmp.

Having discussed this with others we conclude it may be time to change this but that requires a larger cleanup of an extra layer we had to put in to support both pynn 07 and pynn 0.9 from the same pynn script.

Short term a hackaround which is similar to your "workaround" but adds one more critical cleanup is to call:
globals_variables.unset_simulator()

Björn Jürgens · Answer 4 · Fri Mar 06 2020 22:13:05 GMT+0800 (China Standard Time)

Thanks, I will use globals_variables.unset_simulator()