Build your own role-playing game: the business continuity plan drill
Business continuity at WeTransfer
WeTransfer is proud to be ISO27001 certified. Business continuity management and related policies are required to be defined and tested as part of that certification. As such, our Security and Site Reliability Engineering teams have collaborated on writing, reviewing and testing our Business Continuity Plan.
We have now created and run a number of business continuity plan drills and learned many things along the way. For example,
- We realized the importance of keeping a drill going at all times to keep all participants engaged and maximize learning potential.
- In similar vein, we’ve learned that this type of drill is not a test of technical ability, but a test of process, collaboration and communication.
- A fake crisis can make people just as nervous as during a real incident.
- A business continuity plan drill can be great tool to get senior management buy-in for reliability and security roadmap items.
We have taken these and many more lessons learned while working on our drills and turned them into a 10-step plan to help you develop your own. But, just in case you are new to the world of business continuity plans and drills, let’s start with some context first.
Business continuity, what’s that?
Business continuity is an organization’s ability to maintain essential functions and services during and after a crisis or disaster. Whatever happens, we want to make sure we can continue or restart critical operations.
To help us achieve that goal, we can define a business continuity plan (BCP). In this plan, we capture procedures for responding to and recovering from disaster or crisis. If you’re curious to learn more about this type of plan, you can check out this resource.
With a business continuity plan drill, we can test whether our plan makes sense and whether people are prepared for emergencies.
Why run a business continuity plan drill?
It’s often clear how engineers respond to and manage incidents, but disasters that threaten a business’ ability to operate core functions don’t occur that often (✊🪵) and involve a different group of decision makers.
To reduce risk, test defined processes and build a disaster response muscle, we can run BCP drills (just like an incident drill) without endangering a company’s reputation. This is where simulations and table-top drills can help you out.
How to build your business continuity plan drill
Crafting a BCP drill may seem daunting, but no worries, based on our experiences we have defined ten steps to ✨ BCP drill greatness ✨.
1. Define what you want to test
Before you start crafting your drill, brainstorm what you would like to assess or which learnings you would like to uncover. For this type of drill, we ideally focus on critical services and functions. For inspiration you can read your business continuity plan, identify flaws in your systems, gaps in your processes, or single points/engineers of failure. Past incidents can also be a great source of inspiration.
Some real-life examples to help you get going here:
- Check applications’ user access and rights required in incident response to see whether all your participants are capable of doing what’s needed according to BCP.
- Check whether contact information is available and up to date for all SaaS providers you rely on. Alternatively, is enough contact information available for all employees required during BCP activation?
- Identify superhero engineers who are always there to help during incident response. Maybe it’ll be interesting to see what happens if they’re not around.
- Almost anything that’s described in your actual BCP can be tested: Do participants act according to role descriptions? Do participants activate the plan when requirements are met? Do they have some of BCP committed to memory? Are they aware of any disaster recovery plans?
2. Pick a scenario
A solid BCP contains scenarios that would trigger activation and provide you with a starting point, e.g. infrastructure failure, loss of core functionality, or a security breach.
However, feel free to take such a scenario and make it challenging and fun: What led to this particular trigger? Who was (not) around at the time of impact? Is there any collateral damage? Don’t be afraid to go big for the sake of learning.
For one of our drills, we decided to simulate an widespread infrastructure outage that also took out the majority of our communication and collaboration platforms like Slack and Notion as collateral damage.
3. Choose your environment
Your drill environment can vary depending on your organization’s way of working, participants, and drill experience. You can choose to run your drill on-site or remotely. The latter can be done in a call, (dedicated) chat workspace, or a combination of both.
We ran our first drill on-site in a meeting room with all players physically present. A game master was present to facilitate and ran the drill in tabletop role-playing game style (like Dungeons & Dragons). Choosing this environment was a great first step for our organizing team as we lacked drill experience, and it was easier to manage. However, it lacked some realism, which some of our participants found difficult.
Our next drill was run from a dedicated Slack workspace that partly mirrored our organization’s workspace. Our participants were also free to use a video call if they felt that would be more efficient. Instead of having all participants present in the same space from the start, they joined at different points in time during the drill. All these factors contributed to a more realistic and effective drill that allowed us to assess communication and collaboration much more.
4. Choose your players
First, define roles and responsibilities for those part of the organizing team: you’re looking for a game master who will facilitate the drill, actors to play colleagues you include in your drill (in our case we included a platform engineer and support team member) and observers to review and score participants.
Next, define which participants you need to be part of the drill. Check which roles and responsibilities are listed in your BCP. In our case, this would include the CEO, CTO, CPO, and an external communications lead. It’s preferred to have a healthy mix of technical skill levels in this group to learn from multiple points of view. Depending on your scenario, you can include more participants, like a VP of Engineering or a particular engineer, or decide not to invite one of your ‘usual suspects’ to test what happens when they are not around or when they join at a later time.
5. Write a script and prepare supporting materials
This will most likely take up most of your time. A good drill script should include:
- participant briefing, including the rules of the game
- a timeline, either based on milestones or minutes (or a mix of both)
- prompts to get participants unblocked, either in the form of game master or actor interaction
- screen captures of dashboards, status pages, customer complaints, anything that could be requested or needs to be prompted
Last but not least: don’t forget to prepare an evaluation form for scoring participants.
6. Do a test run
Running a drill is not easy: people are unpredictable, and getting them back on your track can be challenging. Doing a test run with a group of engineers (who doesn’t want to play CTO?) will improve your drill, trust me.
7. Run the actual drill
Running your BCP drill should be a breeze after you wrote your script, but there’s a couple of things to keep in mind during the drill:
- Ensure you can easily access all resources you might need during the drill. If you use screenshots, store them on your hard drive and give them meaningful names. You don’t want to lose too much time finding images or prompts while running your drill.
- Mute notifications and get rid of distractions.
- A participant briefing should be part of your drill script, but do allow time for participants to ask questions about how the drill is run. Also allow them to ask questions about actions they can take during the actual drill.
- Record your drill meeting (when applicable) for easier scoring.
- Keep track of drill progress as you run: which milestones are hit? Which prompts have been shared?
- Communicate with your co-organizers throughout the drill in case you need to improvise. This can be done in a (private) Slack channel.
- Don’t be afraid of game master interventions or improvisation when the unexpected happens. Ideally, participants are able to complete all stages of your drill, meaning you will have to guide them through with clues. Sometimes you will have to tell participants not to dive too deep into a rabbit hole and bring them back on your track.
8. Debrief
Allow for some time to debrief before moving on to your next meeting. During this time, blow off some steam (drills can be stressful!), and collect initial feedback and learnings while they’re still fresh.
9. Evaluate, follow up and update
Use the evaluation form you created to score your participants and share how and where they, processes and organization can improve. Combined with potential follow-up actions expressed during or after the drill, this can be turned into a BCP drill report. Do ensure follow-up actions defined by drill participants and organizers are placed on the appropriate backlogs.
Last but not least, dedicate some time to updating the BCP in collaboration with relevant stakeholders. Do we need to add or edit scenarios that warrant BCP activation? Were all instructions useful during the drill? Any contact information missing?
10. Repeat
Create another opportunity to learn from failure in order to improve. SREs, you know the drill (pun intended 🥁). For a next iteration, you can decide to focus on follow-up actions from the previous edition, introduce a more challenging scenario or invite a different set of participants.
What we learned creating and running drills
Most of our learnings are captured in our 10-step plan, but there’s some we would like to highlight in more detail than at the beginning of this post:
- Even a fake crisis can make people nervous. You don’t always need fancy chaos engineering capabilities to run a meaningful drill
- Keep the drill going at all times. Make sure you have enough prompts to get participants unstuck because if they do, it’s demotivating, and you will not get all the learning potential out of it. For example, if they fail to contact an account manager of a SaaS tool impacted during your ‘disaster’, make sure they can still get the same information you intended to relay via another route (let’s say, through an alert received by an engineer).
- Don’t turn a BCP drill into a test of technical ability or a mystery that can only be solved one way because it will distract from what you’re trying to assess. In this type of drill, we care about process, collaboration, and communication. It’s about showcasing the right behaviours and knowledge required to guide a business through every stage of a crisis.
- A BCP drill can be a great tool to get senior management buy-in for reliability and security roadmap items. As organizing team, you can decide to uncover flaws during the drill that demonstrate a need for some of the work you’re planning to do.
We’re aware this was a long read but, hopefully, you’re inspired to start running your first business continuity plan drill and taking this article into action 💪.