Contingency Planning and Business Continuity

The past few weeks have been unsettling and disruptive for everyone. We shared our advice for rapidly spinning up telework for your workforce earlier this week, and today we’re back to discuss business continuity and contingency plans.

While we all hope that none of these plans will ever be needed, thinking through how to maintain your business operations before you need to put emergency plans in action will make your response smoother and allow you to respond rapidly and effectively to changes. Whether these are changes in personnel and worksite availability due to an earthquake or other physical disaster, or a crisis response due to geopolitical events like September 11, resilient companies can and should plan for contingencies before they occur.

What should I do right now?

Creating a robust business continuity plan requires coordination across the company and time, but there are things that you can do today to help your company in the current pandemic. A quick response crisis plan may already exist within your company; if not here are things we recommend that you do today:

  1. Create a call tree of your essential staff so that you can get news out quickly.

  2. Build a list of contacts at suppliers and partner organizations. If you run on AWS, do you know who your account contact is, or how you would get security help? Ensure the list includes ISP contacts, DNS providers, bank contacts, landlords, and other critical providers. Ensure that the knowledge is shared; for instance, if you have a small networking team, ensure that others in the company know how to reach your ISP.

  3. Develop and use a critical communications plan for your employees. Send regular updates sharing company status and make sure that they know who to reach out to with any problems or questions.

  4. Cross train business leaders and key technical staff on critical tasks.

  5. Check our list of technical tips for rapidly shifting to work from home setup.

 

How do I build a plan?

Now that you’ve tackled the immediate priorities, you’ll want to begin to expand and formalize your plans. If you’ve never developed a business continuity plan (“BCP”), then you may be stumped about how to begin. First, you will probably need to develop several targeted BCPs, such as one for your core enterprise operations and one for any websites or services that you operate, as well as a comprehensive plan that encompasses your targeted plans.

Scoping your plans: Determine what your core business operations are and what essential programs and services your company operates. Make a list of all of these, and group them together in a way that makes sense for your operations.

Determining impact: Completing a business impact analysis (BIA) of each essential operation helps you prioritize your planning and responses. If your business has highly critical functionality that would prevent any other processes from continuing, then rank that functionality near the top of your criticality list.

As part of a formal BIA, you would determine a recovery time objective (RTO) and a recovery point objective (RPO) for each process. While we strongly encourage you to perform this analysis, don’t let a lack of time to perform a formal methodology stop you from developing your initial BCP. You can always make estimates now and return on a second round to fill in these metrics.

An RTO is simply the amount of time that a process can be stalled or an asset unavailable before your business simply cannot recover. It can be measured in hours, days, or weeks. RPO is generally focused on data backups and describes the amount of data that can be lost, as measured in time.  For instance, if you run a system that processes logfiles and takes some action based upon events, it might be acceptable to queue logs for processing up to two weeks later, as long as you never lose more than 15 minutes’ worth of incoming logs; in this case, your RTO would be two weeks, and your RPO would be 15 minutes. Alternately, a system like an advertisement display board might have a short RTO, due to the economic losses your company will sustain while it is inoperable, but a long RPO, as any data the board loses can be easily rebuilt from other systems.

Impacts to consider include not just data loss, but financial concerns, health and safety, operational backlogs, and other impacts that might be specific to your business.

While there are several ways to perform a BIA, the output should be some metric that allows you to understand the different criticality levels of each of your processes. Here’s some guidance on performing BIAs:

·       Ready.gov’s Business Impact Analysis How-to, including how to use FEMA’s BIA questionnaire

·       Smartsheet’s BIA toolkit

Prioritize Critical Resources and Processes: Now that you have ranked important processes and assets, capture this criticality rating in your BCP. You don’t want confusion and mixed priorities in an emergency situation, so make sure that your rankings are agreed upon and documented. Now, beginning with your most critical processes and assets, identify people and resources that would be needed to run the process or return the asset to a functional state. In order to create widgets, does your factory need an inbound supply of Acme RoadRunner Food? Does your company need the CFO and two other people from finance to access a website in order to run payroll? To bring your website back up if you lose your production AWS environment, do you need access to GitHub, your master AWS account, and a large supply of espresso drinks? The BCP for each of these should document these requirements and address how to supply them.

Identify Single Points of Failure: Many times, critical business processes run with only a single supplier providing a core component, or only one person in the company able to run a particular task. Now is the time to imagine how you would respond if that single point of failure was impacted. Can you find alternate suppliers? Should you stockpile components, and is that even possible? Which of your employees can you cross-train to perform various tasks? How do you manage granting them required permissions to run sensitive tasks? Can you build in a break-glass to give some users emergency deployment powers to push website updates?

Communication Plans: Document your communication plans as well. How will you update partners, employees, and customers of your status during a disaster or recovery period? If you have a status page for your website, ensure that it is on a different hosting platform than your main website, and decide who will have the authority to determine messaging and who has permissions to push updates. Decide on a regular cadence for communication to your partners and your employees; in a hurricane, perhaps you will send out updates about facilities status once every day at 4 PM; let your employees know this before and prioritize issuing communications, even if it’s a “no status change” message at that time. In a stressful situation, you will want to over-communicate rather than under-communicate.

Plan for different scenarios: This is where BCPs interact closely with Disaster Recovery Plans (DR). Your team should consider all the ways that your business could be affected, from health issues of key employees to resource availability issues to internet outages to large scale disasters like earthquakes and floods. Many people have pushed “pandemic” off into a “to plan for later” category and are now scrambling to adapt to the current situation. Here are a few scenarios to get you started:

  1. Hurricanes: may cause power outages, destroy parts of a network, make buildings unreachable or unusable, cause cellphone outages, may trap key people in particular locations; may also cause surges in consumer demand for some product types or affect key suppliers

  2. Earthquakes: similar to hurricanes, but may be over a larger geographic area

  3. Data corruption: due to disk or service failures, or corruption by an insider with access to sensitive systems

  4. Ransomware attacks: whether affecting only user laptops or critical servers or a mixture; may cause data loss or denial of services and expose weaknesses with backup and restoration systems

  5. DDOS (“distributed denial of service”) attacks

  6. Flooding: may make geographic areas unreachable, destroy electronic equipment on ground floors, cause loss of facilities or access issues

 

How will your staff communicate during a power outage? If your plans are all on binders in a main office, how will you execute your plans if a fire burns down this building? Who is responsible for overseeing the company’s response if a key staff member is ill or otherwise unavailable?

Do you need to plan for geographic dispersion of critical data? If you have two datacenters, but they are both along the same earthquake fault line, you might lose both data centers in a single event. Do you have out-of-region storage, and if so, how long would it take you to get the data back online? For some businesses, replicating key data out of an AWS availability zone or region may make financial sense. For others, backing up to tape and shipping the tapes to storage in another region may be preferable.

·       Learn some of the basics of a DR plan

·       AWS’s Disaster Recovery Guidance

·       Azure’s Disaster Recovery Guidance

Develop a high-level plan and tactical playbooks: Your BCP should include high level company guidance that delineates authorities, key resources and high-level strategy. These documents help ensure that various organizations within your business are all on the same page and know who to contact. Playbooks for teams are also extremely helpful to build out, to provide guidance as your employees tackle problems that arise. List out contact numbers, step by step recovery plans for IT systems, alternate sources of critical resources, and anything else you would need in the moment to respond to emerging situations. Document restoration steps and ensure that criticality rankings are reflected in the prioritized list of responses your teams should undertake.

Document these playbooks in a way that is easy to follow, and make sure that they will be accessible to everyone who needs access in an emergency. While having tabbed binders available in a centralized conference room that can be your operations center is helpful, put copies online as well. Ensure that these plans are protected from unauthorized access, but also that your staff can always access them.

Tabletops, or testing your plans: The best way to find out that your plans are missing critical information is to test them. Tabletop exercises provide a safe and robust way to do a dry run of your processes; get all the stakeholders and responders in a room and do a step by step run through of some scenario that you planned for.

First, identify a non-involved party to run the tabletop exercise; they will be responsible for providing prompts to your response team. You can come up with your own scenarios, crib off of your brainstorming sessions during your BCP planning, or look online for various tabletop scenarios.

After the exercise, make sure to capture what happened and how you responded. Did moments of confusion arise? Did you identify areas for improvement in your responses? Did you know who was responsible for making decisions and communicating them to others? Take the lessons you learned in your tabletop exercise and use them to add detail to your BCP and playbooks; do not assume that because you learned something in an exercise, it doesn’t need to be written down.

 

Live Tests: Tabletop exercises can be a powerful way to ensure that all the stakeholders understand their part in a response, but there are other testing options available once you’ve got a few under your belt. You can also learn from real incidents that occur: did you have a power outage that took out a data center? How did your response go? Incorporate anything you learned from the experience into your playbooks.

With your management’s buy-in, you can also begin to run live simulations in your systems to test the ability of your teams to respond. Have leadership approve a general concept and time window. During that window, your drill leader will contact a handful of key responders and ask them to pretend to be unreachable, and then will contact the rest of the responders and start off the scenario. You can let them know that it’s a drill, or keep that information only with the drill leader and the few folks sitting out the exercise, and let the teams figure out what’s happening. This is an excellent way to observe their ability to respond to an emerging crisis in a way that is contained, and which can be halted if needed.

Capture lessons you learned along the way and perform a review of the exercise where lessons learned are folded into your plans.

 

·       How and why to run tabletop exercises from Ready.gov

·       NIST 800-34, Guide to Test, Training, and Exercise Programs for IT Plans and Capabilities

·       Six Tabletop exercise scenarios from CIS

 

Finally, remember that business continuity planning is an ongoing process, not just a one-and-done process. Plans that are developed once and then stored in a binder for years will be outdated and less helpful in a crisis than ones that are regularly reviewed and practiced.  Business continuity plans should be reviewed and updated on at least an annual basis, as well as following any testing or use of the business continuity plan.

Previous
Previous

Initial Steps Towards A Risk Management Plan: Creating A Basic Risk Register

Next
Next

Cybersecurity Recommendations in a Rapidly Emerging Telework Environment