My notes from the DevOps Handbook

by Gene Kim, Jez Humble, Patrick Debois, John Willis

58. Institute Game Days to rehearse failures

Disaster recovery rehearsals called Game Days - discipline of resilience engineering. An exercise designed to increase resilience through large-scale fault injection across critical systems.

Ensure that services continue to operate when failures occur, without crisis or even manual intervention.

Goal for Game Day is to help teams simulate and rehearse accidents to give them the ability to practice. First, we schedule a catastrophic event, such as the simulated destruction of an entire data center, to happen at some point in the future. We then give teams time to prepare, to eliminate all the single points of failure and to create the necessary monitoring procedures, failover procedures, etc.

Game Day team defines and executes drills, such as conducting database failovers (i.e., simulating a database failure and ensuring that the secondary database works) or turning off an important network connection to expose problems in the defined processes. Any problems or difficulties that are encountered are identified, addressed, and tested again.

At the scheduled time, we then execute the outage.

Expose the latent defects in our system, which are the problems that appear only because of having injected faults

An often-overlooked area of testing is business process and communications. Systems and processes are highly intertwined, and separating testing of systems from testing of business processes isn't realistic. A failure of a business system will affect the business process, and conversely a working system is not very useful without the right personnel.