OPERA-SDS-VnV-11: Verify SDS can recover from disruptions
LucaCinquini opened this issue · comments
Split this into multiple tickets depending on what disruption is simulated
Case 1: disrupt system by killing Mozart
- Start system
- Run 100 jobs to completion
- Create a backup
- Start another 100 jobs
- In the middle of the run, take down the Mozart machine
- Restore the state of the system from the backup
- Re-enable the timers and verify that they automatically re-run the same jobs
Case 2: disrupt by killing one or more of the SPOT workers
- Verify that the jobs are automatically restarted on some other worker
Case 3: Disruptions of the DAAC services for querying metadata, downloading input data, archiving output data
- Close the outbound ports on whatever machine is running the query, download or upload
- Then open the ports again and verify that the system can recover