Frequent SpinnMachineException about one way links

Question

Frequent SpinnMachineException about one way links

gblomqvist opened this issue 4 years ago · comments

I'm using a couple of interconnected SpiNN-5 boards and frequently get the following exception when running a simulation:

spinn_machine.exceptions.SpinnMachineException: Your machine has One way links at (0, 0, 5) on board 192.168.2.33 which will cause algorithms to fail. Please report this to spinnakerusers@googlegroups.com

The core and ip referred to in the message varies. I've so far let the exception crash the program and then started it up again. Sometimes I get the exception multiple times in a row, sometimes I can run multiple simulations without any issue (each starting with a call to setup) before the exception is raised. Between runs I've modified the number of connections in the network as well as sometimes also the maximum number of neurons per core. I've only been running with 1000 neurons.

A traceback from the call to run:

    sim.run(**run_kwargs)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spynnaker8/__init__.py", line 667, in run
    return __pynn["run"](simtime, callbacks=callbacks)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/pyNN/common/control.py", line 111, in run
    return run_until(simulator.state.t + simtime, callbacks)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/pyNN/common/control.py", line 93, in run_until
    simulator.state.run_until(time_point)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spynnaker8/spinnaker.py", line 130, in run_until
    self._run_wait(tstop - self.t)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spynnaker8/spinnaker.py", line 173, in _run_wait
    super(SpiNNaker, self).run(duration_ms)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spynnaker/pyNN/abstract_spinnaker_common.py", line 334, in run
    super(AbstractSpiNNakerCommon, self).run(run_time)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 748, in run
    self._run(run_time)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 868, in _run
    self._get_machine(total_run_time, n_machine_time_steps)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 1120, in _get_machine
    self._machine_by_hostname(n_machine_time_steps, total_run_time)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 1169, in _machine_by_hostname
    inputs, algorithms, outputs, [], [], "machine_generation")
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 1111, in _run_algorithms
    reraise(*exc_info)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 1098, in _run_algorithms
    executor.execute_mapping()
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/pacman/executor/pacman_algorithm_executor.py", line 637, in execute_mapping
    self._execute_mapping()
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/pacman/executor/pacman_algorithm_executor.py", line 653, in _execute_mapping
    results = algorithm.call(self._internal_type_mapping)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/pacman/executor/algorithm_classes/abstract_python_algorithm.py", line 60, in call
    results = self.call_python(method_inputs)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/pacman/executor/algorithm_classes/python_class_algorithm.py", line 71, in call_python
    return method(**inputs)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinn_front_end_common/interface/interface_functions/machine_generator.py", line 104, in __call__
    return txrx.get_machine_details(), txrx
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinnman/transceiver.py", line 831, in get_machine_details
    self._update_machine()
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinnman/transceiver.py", line 679, in _update_machine
    self._repair_machine, self._ignore_bad_ethernets)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinnman/processes/get_machine_process.py", line 207, in get_machine_details
    machine, repair_machine, ignore_bad_ethernets)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinnman/processes/get_machine_process.py", line 233, in _fill_machine
    return machine_repair(machine, repair_machine)
  File "/home/gblomqv/.venv/spynnaker/lib/python3.6/site-packages/spinn_machine/machine_factory.py", line 207, in machine_repair
    raise SpinnMachineException(msg)
spinn_machine.exceptions.SpinnMachineException: Your machine has One way links at (4, 7, 2) on board 192.168.2.1 which will cause algorithms to fail. Please report this to spinnakerusers@googlegroups.com

What does this mean? And is Google Groups preferred over GitHub for reporting this issue?

Alan Stokes · Answer 1 · Mon Jun 08 2020 17:13:20 GMT+0800 (China Standard Time)

That error message SOUNDS like your machine has a dodgy link at chip 4,7 to 5,7. As this is a link between chips on the same board, id be more confident oh this if you got the same errors using that board in a single board setup.

If this is the case, your best talking to @lplana on how to blacklist that link on that board. But you can also tell the tools to black list it (assuming youve got the same set of boards all the time) by adding to your .cfg file in Machine section down_links = [5,7,5]

If this is not the case, and a single board setup does not flag this issue, it implies something more odd. Maybe your multi-board setup is not connected correctly? thus making the boot messages flying around incorrectly and messing with the overall boot? @lplana would be best suited again to diagnose this. Maybe your sata cables are a bit buggered?

There is a third option. You could use in your .cfg file Machine repair_machine=True. This will make the tools to remove the uni directional link and behave as if the link between these 2 chips didn't exist. But the fact the hardware is acting up means @lplana will likely want to look into it.

but basically the tools, when the machine has booted up, has checked the setup and found links that only work in 1 direction. as our routing assuming bi-directional links (aka if you can talk to 5,7 from 4,7. 5,7 can talk to 4,7 as well) any uni-directional links are a major problem. Why they change from uni to bi from time to time is a very interesting question.......

on between Google Groups vs Github. 6 of 1, 6 of another.

Luis A. Plana · Answer 2 · Mon Jun 08 2020 17:42:24 GMT+0800 (China Standard Time)

@gblomqvist: From version 3.2.5 onwards, spinnaker_tools detect and avoid unidirectional links. Could you please check and let us know which version of spinnaker_tools you are using?

The simplest way is to go back to the top of the console output and look for the following lines:

2020-06-08 09:23:09 INFO: Attempting to boot machine
2020-06-08 09:23:15 INFO: Found board with version [Version: SC&MP 3.3.0 at SpiNNaker:0:0:0 (built Thu May 7 10:33:27 2020)]

The version of SC&MP should be 3.2.5 or higher.

If this is not the case, someone in the software team can help you upgrade.

If it is the case, then we would need to take further action. Do you have physical access to the boards, as in would you be able to read the serial number of each board? This is a small white label.
I might have that information but I would need to know where you are located.

Christian Y. Brenninkmeijer · Answer 3 · Mon Jun 08 2020 21:20:23 GMT+0800 (China Standard Time)

Reporting was suggested via the google group as that was considered easier for less experienced users.

Christian Y. Brenninkmeijer · Answer 4 · Mon Jun 08 2020 21:21:27 GMT+0800 (China Standard Time)

We used to have algorithms that depended on all links being bidirectional we no longer have these to the best of my knowledge but still do not recommend using unidirectional links so disabling the link is still recommended.

Christian Y. Brenninkmeijer · Answer 5 · Mon Jun 08 2020 21:23:59 GMT+0800 (China Standard Time)

The best way is to blacklist the blacklist the link or get SCAMP to do it for you as suggested above.

For users who do not control boards there is a cfg setting:
[Machine]
repair_machine = True

But we do NOT recommend using that as it hides/ works around the problem at the wrong level.

Andrew Rowley · Answer 6 · Mon Jun 08 2020 21:41:03 GMT+0800 (China Standard Time)

@gblomqvist, is this a 2-board system? If so can you tell us how the cables are connected?

Luis A. Plana · Answer 7 · Tue Jun 09 2020 02:08:48 GMT+0800 (China Standard Time)

Luis A. Plana commented 4 years ago

Luis A. Plana · Answer 8 · Tue Jun 09 2020 02:41:13 GMT+0800 (China Standard Time)

@gblomqvist 's original post mentions 2 different unidirectional links (there may be more):

first trace: (0, 0, 5) -- chip (0, 0) link 5
second trace: (4, 7, 2) -- chip (4, 7) link 2

These are both connections to neighbouring boards across spiNNlinks (please check figure above).

These cases cannot be completely solved by blacklisting because the blacklists are local to each board. They do not cross board boundaries.

They can be avoided using configuration file option down_links only if the machine is not dynamically allocated because chip coordinates in dynamically-allocated machines change. Configuration file option repair_machine would work in this case.

It seems to me that the best way to solve this issue is to allow scamp to detect the unidirectional links at boot time and make sure that they are not used.

Alan Stokes · Answer 9 · Tue Jun 09 2020 16:14:41 GMT+0800 (China Standard Time)

Just to add to luis's messages. Having looked into how we report this error to make it easier to interpret (given both myself, and the rest of the software team had to discuss what the right next core was from the link data and got it wrong on the first attempt). Anyhow whilst cleaning up the message to explicitly say which chips and in which direction and on which board the issue is at i found another issue.

The error stops at the first uni directional link it sees. Now in theory its meant to be deterministic, so id like to say you should have seen the same link over and over again. But then python has bit us and even though i cant see why it wouldn't be deterministic, i never 100% confident it would be. So when you say you see different links. it might be the other links were still there, detected, but not reported.

Anyhow, once ive finished my wee fix, it at least will tell you ALL the unidirectional links. not just the first, and the error message should be much more usful to inform us of the actual problem.

Christian Y. Brenninkmeijer · Answer 10 · Tue Jun 09 2020 18:33:38 GMT+0800 (China Standard Time)

Unless we hear from https://github.com/gblomqvist I will assume in SpiNNakerManchester/SpiNNMachine#134 that his twop boards are only connected by a single fpga cable so have no wrap around,

gblomqvist · Answer 11 · Wed Jun 10 2020 02:23:46 GMT+0800 (China Standard Time)

Sorry, I've been away the last couple of days. I'll try to answer your questions. I should note that I don't have physical access to the system, at least not for the time being.

Christian Y. Brenninkmeijer · Answer 12 · Wed Jun 10 2020 17:30:22 GMT+0800 (China Standard Time)

Please run the following to get the size of the machine two boards create.

import spynnaker8 as p
p.setup()
machine = p.get_machine()
print(machine.width, machine.height)
p.end()

or if that fails just add the lines
machine = p.get_machine()
print(machine.width, machine.height)
just before the end in a script that works for you

Alan Stokes · Answer 13 · Wed Jun 10 2020 22:30:20 GMT+0800 (China Standard Time)

and i persume that those 2 multi-board machines each contain more than 2 spinnaker boards within them? are they behind a spalloc system? or allocated manually?

gblomqvist · Answer 14 · Thu Jun 11 2020 09:33:11 GMT+0800 (China Standard Time)

I've been unable to communicate with the system since my last post, I don't know if its been unplugged or something else. For the time being I can provide some of the information asked for.

It's a six-board system.
I logged all exceptions from the simulations I ran earlier and so I could extract the different links complained about. The following links were extracted from 177 exceptions of this kind:
- (0, 0, 5) on board 192.168.2.33
- (1, 0, 4) on board 192.168.2.33
- (1, 0, 5) on board 192.168.2.33
- (2, 0, 4) on board 192.168.2.33
- (2, 0, 5) on board 192.168.2.33
- (3, 0, 4) on board 192.168.2.33
- (3, 0, 5) on board 192.168.2.33
- (4, 0, 4) on board 192.168.2.33
- (4, 7, 1) on board 192.168.2.1
- (4, 7, 2) on board 192.168.2.1
- (5, 7, 1) on board 192.168.2.1
- (5, 7, 2) on board 192.168.2.1
- (6, 7, 1) on board 192.168.2.1
- (6, 7, 2) on board 192.168.2.1
- (7, 7, 1) on board 192.168.2.1

@alan-stokes I'm not sure I understand your first question, but I'm guessing the answer is: it's one machine containing six boards. Regarding spalloc, my spynnaker.cfg says spalloc_server = None, and so I assume no spalloc system is in use.

Andrew Rowley · Answer 15 · Thu Jun 11 2020 14:42:13 GMT+0800 (China Standard Time)

Thanks for the information. This sounds like the cables between these two boards might not be connected completely, or else the cables themselves might be faulty. 192.168.2.1 is likely to be the first board in the frame (leftmost) and 192.168.2.33 is then the 5th board from the left (if I have the maths right). It would be worth checking the cable(s) between those two boards to make sure they are plugged in correctly.

As you have stated, you don't have access to the board, your best option for now is to go with the suggestion of @Christian-B above and add the following to your .spynnaker.cfg file:

[Machine]
repair_machine = True

Once the machine has been checked, you will hopefully then be able to remove this option.

gblomqvist · Answer 16 · Sun Jun 14 2020 10:31:25 GMT+0800 (China Standard Time)

Okay. I've got information that the cable from the leftmost board to the fifth board from the left has been replaced. I've been able to run tests again and unfortunately, the issue persists. I suppose the replacing cable could be bad (don't know if that has been checked), or maybe 192.168.2.33 isn't the fifth board from the left in this setup? At least repair_machine = True seems to work, thanks.

@lplana Version 3.2.5 is used ("INFO: Found board with version [Version: SC&MP 3.2.5 at SpiNNaker:0:0:0 (built Thu Aug 1 10:15:06 2019)]"). I can probably get hands on the serial numbers, if that is still wanted.

@Christian-B Not sure if you still want to know the output of the small program you posted, but just in case, it's "24 12".

Luis A. Plana · Answer 17 · Sun Jun 14 2020 17:42:02 GMT+0800 (China Standard Time)

@lplana Version 3.2.5 is used ("INFO: Found board with version [Version: SC&MP 3.2.5 at SpiNNaker:0:0:0 (built Thu Aug 1 10:15:06 2019)]"). I can probably get hands on the serial numbers, if that is still wanted.

@gblomqvist Thank you for the version information. This tells us that these are dynamic faults, not picked up at boot time. The most likely source of dynamic faults are SATA cable connections. This is confirmed by the fact that all reported faults are on board boundaries .

The cables have proven very reliable in our machines but sometimes they are not properly sat inside the connectors, causing unreliable transmission. The solution is simply to unplug the cable and plug it back in again, making sure that it goes in all the way. If the cables have latches, there should be a distinctive 'click' sound when the cable is inserted correctly. It's very important to resit both ends of the cable.

Your list of faults above relates to two different cables. If the machine is structured in the usual way, these would be:

NORTH on board 0 (192.168.2.1)
SOUTH on board 4 (192.168.2.33)

Boards are labelled 0 to n-1, starting from the right end. There are six SATA connectors on each board. NORTH is the fifth from the top and SOUTH is the sixth (last/bottom).

If the machine is structured in the usual way, the other end of both cables would be on board 2 (SOUTH and NORTH, respectively).

Andrew Rowley · Answer 18 · Mon Jun 15 2020 15:52:39 GMT+0800 (China Standard Time)

Ah sorry, I might have given bad advice suggesting to count boards from the left end...

Christian Y. Brenninkmeijer · Answer 19 · Mon Jun 15 2020 16:44:59 GMT+0800 (China Standard Time)

The size "24 12". confirms that this is a six board machine configured to use full wrap around.

gblomqvist · Answer 20 · Wed Jun 17 2020 11:01:10 GMT+0800 (China Standard Time)

@rowleya No worries. ;)

@lplana Checking those cables resolved the issue, thanks!

Thank you all for the help.