Running Disaster Recovery Plan Tabletop Exercises

This article originally appeared in ;login: Online on Mar26/25.

The University of Michigan (U-M), founded in 1817, is the oldest institution of higher education in the state, with over 52,800 enrolled students, around 9,200 faculty and 47,900 staff, a main campus in Ann Arbor and two regional campuses in Dearborn and Flint, and a hospital system with many statewide general and specialty clinics. The College of Literature, Science, and the Arts (LSA) is the largest and most academically diverse of UM-Ann Arbor's 19 schools and colleges with approximately 22,000 students and 3,600 faculty and staff, making it larger than some other universities. LSA Technology Services (LSA TS) has about 180 full-time staff to manage all of the administrative and classroom technology in nearly 30 central campus buildings and several remote facilities both in and outside the state. The LSA TS Infrastructure & Security team had 13 staff and two managers who provide or coordinate over 70 foundational IT services for the college and the university.

The Infrastructure & Security team holds a monthly "Innovation Day" where we can focus on a specific project or service outside of our regular daily operations, and completely ignore chat, email, and tickets for the day. On our Innovation Days in February and March 2024 we ran tabletop exercises of some of our Disaster Recovery Plans (DRP).

Problems

Our team has had DRPs for at least our major customer-, patron-, or user-facing services since at least 2008 (the oldest template I could find), but we identified several problems:

Our assumptions were poor. Most plans only assumed natural disasters such as an earthquake or tornado. Only some (but not all) of the plans assumed service, server, or component failure. Few plans specified recovery steps beyond "Rebuild from scratch" and "Restore from tape."
Our documentation was incomplete. Most services didn't have a current build guide to use as a basis for rebuilding. Not all of our existing production services had DRPs. When spinning up a new service, the plan author would often make a copy of someone else's plan and edit it... even if that plan was missing one or more sections from the template.
Our testing was inconsistent. We tested each disaster recovery plan at least annually, but testing could be performed differently depending on the plan owner. Most would perform a (solo) thought experiment, but a few would build a new VM, stand up the service using its build guide, restore from backups, and compare it to the production VM.

Plan authors often worked in isolation and could have different assumptions than other authors, such as one assuming that scheduled tasks would be easily remembered or recovered, even if the system was upgraded. Furthermore, the original author of the plan was likely the service owner, service manager, or technical lead when it was first written... but staff turnover and service reassignments meant a different person became responsible for that plan and could have different assumptions than the previous owner.

Our web hosting environment service's DRP is an example of this: it was written/created by one person who left for another school, taken over by a second person who left for a different unit at the university, and is now owned by a third... whose tenure didn't overlap with the first. Each of the three likely had different assumptions and blind spots. For example, when the servers were upgraded the scheduled tasks that copied site archives to long-term storage and deleted them from local disk were themselves deleted, and nobody noticed until after the older OS backups expired.

We wanted all plan owners to:

Operate with a common set of assumptions.
Identify all service dependencies up and down the stack.
Write and test their plans the same way.

What goes into a DRP?

To address writing plans the same way, in 2022 we standardized our DRP template and included embedded guidance about using it. Its major sections are:

Description — A high-level overview of the service itself
Dependencies — A list of specific dependencies on which this service depends and a list of which other services depends on this service
Hardware — A list of physical and virtual hardware for the service, including host names and IP addresses, model and serial numbers for physical hardware and hypervisor environment for virtual hardware, the CPU count, memory size, disk size, location (data center, row, rack, and position), and any inventory assetID. If the service runs on virtual machines then there won't be model numbers, serial numbers, or assetIDs
Software — A list of any specific software for the service (with specific version numbers if applicable, such as "libfoo.so.12 only" or "version 23 or earlier." This is only required where a software vendor mandates certain versions of dependencies.
Configuration — Any specific configuration information, such as service account names or database user accounts; passwords should be in your password management solution
Security and facilities — Details about the physical security for the hardware and the online security of the servers, appliances, databases, and so on
Backups — Details about how the service components (such as databases or servers) are backed up
Operations — Information about the service operations: when and how the DRP will be tested; what the minimal and fully operational standards are, such as "single server read only" versus "multiple server read-write;" where the operational documentation such as architecture diagrams, build guide, run books, administrators guide, users guide, and so on are; and any service-specific scheduled tasks
Contacts — Contact information for the service management, including the service owner, service manager, and technical leads for the service; any service-specific customer email lists for notification purposes; people at any partner organizations who may provide dependencies or be dependent (such as Facilities or Plant Operations); and any hardware, software, or service vendors involved in providing the service or its components
Service levels and impact — Information about the impact of the service not being available, including but not limited to one or more of Service Level Agreement (SLA), Service Level Expectations (SLE), Operating Level Agreement (OLA), a statement of impact, impact timelines, and maximum acceptable downtime

Not all plans require all of that information. For example, the "Software" section might just be the operating system specification. You might not have, use, or need formal SLAs or SLEs.

Running the exercises

Formal DRP tabletop exercises tend to have four roles:

Facilitators provide situation updates and moderate discussions. They also provide additional information or resolve questions as required. They may also assist with facilitation as subject matter experts (SMEs) during the exercise.
Players are personnel who have an active role in discussing or performing their regular roles and responsibilities during the exercise. Players discuss or initiate actions in response to the simulated emergency. They respond to the situation presented based on current plans, policies, and procedures.
Observers do not directly participate in the exercise; however they may support the development of player responses to the situation during the discussion by asking relevant questions or providing subject matter expertise.
Data collectors observe and record the discussions during the exercise, participate in data analysis, and assist with drafting the After-Action Report (AAR).

Diagram illustrating
DRP exercise roles. Top left has a
Figure 1: Diagram illustrating DRP exercise roles. Top left has a "thinking face" emoji labeled "Facilitators facilitate;" connected by double-headed arrows to top center with a "speech bubble" emoji labeled "Players participate;" connected to top right with a "magnifying glass" emoji labeled "Observers act as SMEs." All three are connected with double-headed arrows to the bottom item, a "writing hand" emoji labeled "Data collectors scribe."

We decided early in our planning process that we didn't need that level of formality during the exercise itself and combined the observer and data collector roles. Instead our team used two consecutive Innovation Days to run informal guided tabletop exercises to:

Level-set our assumptions, dependencies, and expectations.
Improve our understanding of how we would actually implement our DRPs.
Consider what we may need to add to, change in, or remove from our environment (for both the service and the organization).
Identify and fill any gaps in our DRPs.
Identify any blind spots to remediate before our next disaster (and there's always a next disaster).

We tested three services' plans:

Our virtual machine infrastructure in the daylong session 1
Our web hosting environment in the morning of session 2
Our on-campus data center in the afternoon of session 2

People in service management roles (the service manager, service owner, and technical leads) played as themselves. Others played as other parties, such as our own management, the Dean's Office, central IT's Identity and Access Management (IAM) team, the network team, and so on. We tried to get everyone to participate as a player at least once, and we asked 3–4 different people to act as observers and data collectors in each exercise.

However, we did not:

Hide the facilitator behind a scrim. We thought that was too adversarial.
Use dice to determine actions' success or failure. In reality we have to get to success.
Draft a formal AAR document. Subsequent discussion at a team meeting was sufficient.
Notify our customers, patrons, or users, by email, chat, social media, web banner, or otherwise. They did not need to be aware of our collective thought experiment.

Exercise 1: Virtual machine (VM) environment

We postulated that an attacker got into one hypervisor node and thus the entire cluster and its storage, including every VM's disk image and their snapshots and backups. Had the event been real and if the attacker waited long enough for the external backup solution to expire its tapes, we would have had to rebuild the VM environment and every service on it from scratch without any trusted backups.

Exercise 2: Web hosting environment

We hypothesized that an attacker gained root access to the web server for over 130 hosted websites. This was mitigated by host-unique root passwords, no ssh-as-root, and (unlike our VM environment) no on-host backups, so the attack surface was limited and recovery from backup was possible. If the event had been real we still would have had to rebuild the server, reinstall the hosting software, recover all the sites from backups, and change all of the websites' passwords, keys, and certificates.

Exercise 3: On-campus data center

We assumed that an attacker had gained physical access to our on-campus data center. Since we had time during the exercise we considered destruction, theft, and destruction to cover up theft. This was mitigated by having moved about 80% of our physical hardware to virtual hardware, our newer off-campus data center, or both.

We identified an interesting constraint that not everyone was aware of. One researcher's grant requires the server and its data to be in a specific locked rack, so we can't move it to another data center or to another rack in this data center. We also identified two open questions to answer after the session:

Are all of our disks encrypted at rest, so theft of a drive doesn't mean the thief can access that research data?
What really happens when someone presses the emergency power-off (EPO) button? We know what we think happens, but Facilities may think differently or know otherwise.

If the event had been real we have specific documented processes for what to do and who to contact in our Facilities, General Counsel, Plant Operations, and Risk Management offices, as appropriate depending on the nature of the event.

Measuring success

How did we define and measure whether these exercises were successful? First, we looked at our initial goals. We determined from the wrap-up discussions during each of the exercises that we did level-set our assumptions, dependencies, and expectations; improve our understanding of how we would actually implement our DRPs; consider what we may need to add to, change in, or remove from our environment (both the service and the organization); identify and fill any gaps in our DRPs; and identify any blind spots to remediate before our next disaster.

We also surveyed the participants after both days. We asked four questions:

On a scale of 1=awful to 5=great, how well did today go?
What did you like most or find most helpful?
What did you like least or find least helpful?
What could we have done better?

After the second session we also asked if the first session was better, the second session was better, or if they were about the same.

Bar graph labeled
'Average score' with scores of 4.30 on day one and 4.71 on day two
Figure 2: Feedback survey results - numeric scores.

Our numeric scores were:

For session 1, we scored 4.30 (n=10, min=3, max=5, 71.43% participants' response rate).
For session 2, we scored 4.71 (n=7, min=4, max=5, 63.64% participants' response rate).
The second session was either as good as (57.1%, n=4) or better than (42.9%, n=3) the first session.

People thought that the first session was engaging. One respondent said they "learned quite a bit about DRPs and how we should handle things." A second respondent said "It was good to see the thought processes and priorities for each member of the team and the collaboration in solving problems (or creating them) as we worked our way through the virtual incident." A third said that "The opportunity to dig into some deep thoughts on some very multifaceted problems as an entire team was enormously helpful." Several felt the roundtable discussion was helpful and the session "worked out well as a brain-storming session" and "was helpful for those not involved in the operation of the service to understand the challenges it would experience during a disaster recovery procedure." People identified holes in their own plans, new things to think about, perspectives to change, and a deeper understanding of the service itself.

However, some thought that we got lost in the weeds a bit. One respondent said "It's easy to go down some pretty deep rabbit holes. There's a balance here, because those rabbit holes sometimes dig up value, and sometimes don't." Another respondent wanted the problem to get more clearly defined as we worked our way through the discussions. In addition, including our Major Incident (MI) communications and coordination process was an unnecessary complication. One person provided our favorite comment: "This exercise was terrifying."

We agreed with the feedback. For the second session we defined the initial problems more clearly and omitted the MI process.

After the second session people thought that the more-focused topics were easier to manage and the open discussion was still helpful. We focused more on technical actions than communication actions which was an improvement. However, some still thought we spent too much time going down rabbit holes and that the scope may have been too narrow for everyone to feel like they were contributing.

Next steps

At the end of each session we identified next steps for both the participants and the facilitators. Participants were asked to update their services' plans to include the common dependencies and anything else the plans were missing. Those common dependencies that some people thought were "too obvious to list" included:

Power and cooling
Networking
Centrally-provided backup, database, server, and storage services
Authentication services, including our 2-factor provider and our password management solution
Time services on which the authentication services depend
Notification channels, such as chat, email, and social media
Documentation storage, such as an external service, knowledge bases, wikis, or local or network-attached disk

Participants were also asked to create missing or update existing architecture, build, or design documents, to implement any remaining takeaways from the discussion, and to reconsider restoration time versus the declared maximum acceptable downtime. For example, if it would take two days to get replacement hardware spun up and another four days to restore the data, then without access to a time machine the maximum acceptable downtime cannot be less than six days. If there's a disconnect then either the business unit has to accept a longer maximum acceptable downtime or they have to provide funding for the design, implementation, and operation of the service to meet their requirement of that shorter maximum acceptable downtime.

In the short term the facilitators would analyze the survey results and feedback, discuss them at subsequent team meetings, decide on a cadence for repeating the exercises (annually in February and March), and develop a training module and documentation templates to train additional facilitators. For the university they also reached out to the Disaster Recovery/Business Continuity Community of Practice leads about presenting our exercises and results.

Additional considerations

If you're working on your own disaster recovery plans we have some recommendations for things you should consider.

When writing your plans:

Do you have associated architecture and build documents?
Do you keep them updated when things change?
Is there a change log so you can move from "Rebuild as documented" to the state immediately before the disaster?
What if the plan links to an external source but there's no network access or either you or they have fallen off the Internet?

When testing your plans:

How much testing is enough?
Are you "just" doing the thought experiment or are you building new servers and restoring the data before comparing them to production?
How and how often are you testing: for example, do you do a thought experiment annually and rebuild it every 3–5 years?

Wrapping up

We set out to revamp our DRPs and make them more useful: we wanted to explore a broader range of scenarios, find incomplete documentation and testing, and identify where team members held differing assumptions about our infrastructure capabilities and requirements. We examined three specific scenarios: virtual machine infrastructure compromise, web hosting environment breach, and on-campus data center intrusion. These exercises aimed to refine our DRPs by revealing weaknesses and improving the team's understanding of them# and terrified some team members. In short, we learned a lot, both about our systems and about running DRPs: who should be involved, how broad or narrow the focus should be, and how to structure the discussions. We also had a lot of productive conversations with each other and with the business units we serve.

Resources

Some resources that may be helpful if you're going to run your own tabletop exercises include:

Our project management page for the tabletop exercises
The "Disaster Recovery Management" page at the U-M "Safe Computing" website
Backdoors & Breaches, an incident response card game

Also useful is the Cybersecurity and Infrastructure Security Agency (CISA) Tabletop Exercise Package (CTEP), especially:

Critical Infrastructure Tabletop Exercise Program (26-page PDF)
Exercise Planner Handbook (40-page PDF)

Back to my conference reports page

Back to my professional organizations page

Back to my work page

Back to my home page

Last update Mar27/25 by Josh Simon (<jss@clock.org>).