You Need a Plan B
Incidents are always occurring both internally and externally that can affect the operation of a business. An internal incident might be a server or two failing causing a bump in service availability. An external incident might be a natural disaster like flooding, a tsunami, or an earthquake. Regardless, these things can and in several cases do happen.
Rather than crossing your fingers and hoping the worst doesn't come to pass, your business should be taking steps to continuity. Not just for the particularly big events, but also the smaller normal ones whose brief annoyance are still undesirable. We'll be going into some detail about how to approach business continuity from those little things to those bigger things.
The Basic Roadmap of Continuity
Regardless of what you're preparing for, the process generally looks the same, which is great to template things out and fill in the blanks.
Continuity in the occurrence of incidents should look something like this:
- Title/type of incident - you need to be able to identify what you're planning for and what incident(s) the continuity plan applies to.
- Measure of qualification - a methodology to identify, measure, and qualify the extent at which an incident can/is or could affect operations. Having methods of measurement allows you to better react to incidents. An incident may be isolated and not necessarily trigger a more urgent continuity response.
- Target time to continuity - incidents may occur abruptly. When they occur how long should it take before your continuity plan kicks in and efforts are diverted to taking action?
- System/Business state of operations - normal operations are going to be impacted more than likely. What degree are things affected and what are they? This is where you define what the acceptable operating capacity of the company is.
- Actions and owners - who takes ownership of the actionable items that need to be implemented in preparation of incidents, who acts upon these policies/parameters, and what are they?
- Scope - who is affected by the continuity plan taking effect in accordance to necessary Actions? Do your employees need to be ready to abandon the office and work remotely? Are there employees that are put on mandatory leave due to equipment that cannot be operated at normal scale during the incident? Does it affect business partners, suppliers, clients, etc.?
- Communication plan - how your organization and owners of key points of the continuity plan relay state of affairs and in what cadence and to whom.
- Escalation path - unfortunately, sometimes things get worse instead of improving over time. A solid continuity plan factors in recommendations of action in case the situation continues to deteriorate over a period of time.
Target time to continuity functionally behaves similarly to the concept of what occurs in natural disaster plans. A great example occurring outside of business is "how long should it take backup generators to start producing power in the event of a power loss situation?" This asks and answers the question of if an incident occurs, how much lead time is necessary before it takes effect. There may not be a single simplified answer and timeframe.
When an incident occurs or when you believe an incident to have occurred or may occur, you need to have a way to identify and qualify it. Without methods to measure, qualify, and identify what the measure of qualification is you risk triggering an excessive continuity response. You also risk the opposite.
A thoroughly prepared organization will take into account possibly many moving pieces - departments/locations of the company as well as internal systems and external systems - to work out what dependencies exist. This is the System/business state of operations part of the plan, and it is common for this to require the most time. Thoroughness here gives a great map of how the company functions and what it relies upon. Regardless of incident preparation for continuity, it's very useful to have and review periodically - changes over time become less time intensive after the initial drafting.
The system/business state of operations definitions highlight dependencies for certain aspects of the business. It is also where you identify the mission critical functions of the business as well as what capacity those functions should be operating at during the incident. Naturally, this means any mission critical aspects of the business must have alternatives or resilience to ensure availability. For example, if a key department relies on Sharepoint that is hosted within the company, you need to ensure the employees who need to use it can do so - likely both on and off premise. However, if you were using Office 365, the only requirement for continuity for that application would be reliable internet connectivity regardless of physical location of the employee (and trusting Office 365 itself remains available during the incident).
Even if you have worked through and identified the core functions of business that need to continue operating, it is highly unlikely any single individual within the company has the proper ability to oversee that all the requirements can be met. By assigning a consul of owners and actions, you spread the responsibility and promote cooperation between boundaries. This has the added benefit of improved cohesion across teams and departments even beyond contingency situations. Owners of actions are those best positioned to implement and document. This may mean owners of actions in your continuity plan consul may not necessarily be management or leadership.
Actions for continuity will likely affect your company's staff to a degree, and may result in some number of staff not having sufficient ability to be productive during the incident or the business function those employees support not being operable. An example of the latter situation would be in manufacturing where a fab is ravaged by a natural disaster. The manufacturing employees cannot be productive without their equipment and nearby fabs may not have sufficient ability to overflow workers into them to offset lost production. Scope identifies who is affected and in what ways by the incident and continuity actions. However, it can affect many more than just your company's own staff. If your company is a services provider like NocTel, it may mean delays in project work or a need to adjust deadlines with external entities.
While having parameters of how the business will operate during an incident, what is to be implemented and by whom, and who this all affects is a great start, it can all come tumbling down in short order if you don't have a communication plan. Reliable communication on a normalized cadence is crucial to ensuring smooth operation during the onset of the incident as well as on the path to recovery. NocTel's own Status Page is a good simple example of effective communication. Without communication to employees, external stakeholders, and internal stakeholders there is no controlled state of reality. When you lack reliable communication, perception starts to become reality and becomes harder to control when left unattended or inconsistent across stakeholder segments.
How easy is it to get a real grasp of the situation when everyone is tweeting their two cents and news outlets are reporting slightly different variations of information? It simply isn’t. This is why organizations often use an official channel to disseminate all pertinent information at that point in time.
But don't just rely on email always being available. Consider other means of disseminating information to ensure staff are always informed. This is especially important in regard to stakeholders. There should be no fewer than two methods of communicating. Combine channels of communication like setting up an information hotline that plays back a recording of the current situation, an SMS mailing list, an email list, and even utilizing channels like Twitter with an officiated handle.
Finally, while it's an uncomfortable consideration, it's always prudent to also ask, "What else could go wrong and how do we handle that?" Even in smaller companies, there is no shortage of things that can go wrong in an already unfavorable situation. If a smaller incident spirals into a larger, more dire one, it is useful to establish an escalation path. Escalation paths typically transition from a less urgent continuity plan to a better fitting and more urgent plan. If you have measures of qualification, you already have a good start on identifying when things are getting further out of control and react accordingly.
Planning Means Nothing Without Practice
Disaster and incident planning are often mental exercises we do, which is a great place to start for any undesirable situation. Thinking on a potential problem can spur us to take preventative steps. However, without proper practice, preparations or plans to prepare may as well not exist.
It's a natural reaction to feel put at ease having through through various problem scenarios or taking steps to handle. Unfortunately, like fear or worry, your comfort may be immaterial. Practicing your preparations is the surest measure of building confidence that you've taken appropriate measures and they are indeed effective. Perhaps more valuable is observing places where the plan is weak or ineffective to improve ahead of the curve. You are in an advantageous position to be better prepared if you discover faults in your plan during practice than during an actual incident. Don't take failures in practice or missed details that weren't accounted for as discouraging - these should be appreciated and recognized as opportunities to do better without having to suffer the real consequences.
Failure is the Mother of Success
However, you may not always have the opportunity to test your plan before disaster strikes. Things will become hectic, but once the dust settles the conversation should never be a matter of finger pointing and assigning responsibility for failures and losses. A proper postmortem after an incident will holistically look at what occurred, what failed, and why with the intention of improving.
Frustration, anger, and blame are common coping mechanisms for us humans when we find ourselves in difficult situations we might not have any influence over. In postmortems, avoid patterns of "X failed because Y". This pattern easily assigns blame or fault, which can quickly put all participating stakeholders on defensive edge. Instead, examine failures or problems as "X did not happen or failed. To prevent this in the future, we should...". Phrasing reflections this way asserts the recognition of a problem and orients the conversation toward solutions rather than attempting to assign blame. It also does not specifically single out individuals and force remediation upon them. Always remember continuity planning is a cooperative effort.
If you find that cooperation between stakeholders is difficult or strained, it may be a symptom of a fragmented company culture. Defaulting to attitudes of self-preservation insinuates a lack of trust, which can form destructive patterns even when not in the midst or aftermath of an incident. Attempts to deflect or push follow up actions to others may also be a defensive mechanism for individuals or teams that are heavily burdened.
Invest Resources to Planning
While you may be reading this as a manager or an IT technician or someone who has concerns over what to do in the face of disaster or otherwise business impacting incident and wants to develop a continuity plan, you will generally not get very far without the support of leadership.
Unfortunately, leadership does not always see the value in investing time, resources, and effort into continuity plans. The business calculus tends to not work out since it can easily be viewed as putting money into a sort of insurance policy rather than being put to use to budgets or growth initiatives. Leadership often evaluates direction based on cost, benefit, and risk. Continuity planning is often viewed as high cost, low benefit, and low risk. Compare to the ideal evaluation of low cost, high benefit, low risk; and it becomes a little easier to understand why leadership may initially balk.
Fortunately, leadership tends to measure prospects, success, and failure. In trying to build a case to leadership for continuity planning, you should present in terms leadership will recognize and make sense of quickly.
To kickstart things there are several dimensions in which you can assert the business will suffer (or potentially miss benefitting) from without a proper continuity plan:
- Loss of revenue due to diminished workforce or unavailability of profit generating services and equipment/facilities
- Negative impact on brand reputation due to incoherency of information internally and externally
- Decline in productivity due to inability to source materials through supply channel due to unavailability of primary channels with no secondary methods of procurement
- Disarray and decline in employee sentiment due to perceived lack of direction during crisis
- Fostering of internal subject matter experts that strengthen the company even outside of active incidents
- Positive impact on brand reputation as being reliable or professional attributed to well-rehearsed communication strategy and channels of communication internally and externally
- Positioning of the organization as a leader to advise associates, community, and partners how to prepare and execute their own continuity plans
- Opportunity to improve sales/service adoption due to minimal impact on day-to-day business operations while competition falters
With sufficient analysis, measures, and metrics to support the claims, leadership now has the information presented that speaks to them in terms they are familiar with. What many do not recognize is the potential competitive advantage that can come with a strong continuity plan. Performing better is not limited to a race to see who performs higher consistently - it can also be a matter of who is affected the least negatively.
A great simple example where a cost with no real benefits at face value paid off is for website and service operators who implemented SSL HTTPS when most were only using HTTP. When Google shifted and gave higher preference to sites that were using HTTPS over those that weren't, it became a competitive advantage. It also started becoming more of a standard requirement to be considered trustworthy on the internet. Those who prepared and started using HTTPS ahead of the curve didn't incur any technical hurdles compared to the competition.
Complacency is the Death of Good Intentions
While the efforts to design, implement, and test a continuity plan can be extremely rigorous, it should never be considered "complete" and case closed. Businesses constantly evolve and change, which means parameters and requirements are also changing. While it may not be the thing you look most forward to, giving a focused effort periodically - typically once a year - to review your continuity plan is an exercise worth pursuing. This becomes easier to practice and repeat if you're also combining the review with practice of the plan as you will have results to reflect upon that give a more concrete idea of what to expect when a real incident occurs.
Even the smoke alarms in your home require testing every 6 months to ensure correct functioning. If something so trivial to check takes a couple minutes to poke a button with a broomstick every 6 months, how could you not afford to test, review, and improve your continuity plan?
So Where Do You Begin?
Business continuity and disaster recovery happen to be two very similar scenarios IT often has to prepare for. Given its commonness in industry, there are already many resources you can refer to for guidelines, templates, and checklists. Here we will give a brief list of resources and literature to help speed along the process providing good insight and examples:
- SANS Institute, Introduction to Continuity Planning. Largely focused on IT application - general info starts on page 6.
- Curtis Keliiaa, Business Continuity Planning Concept of Operations. An approachable introduction to how continuity may be structured.
- Bryan Martin, Disaster Recovery Plan Strategies and Processes. A thorough primer on structure, audience, and identification of incident (disaster) scenarios.
- Patrick Kral, Business Continuity on a Stick. A heavily IT-focused paper that explores the secondary pitfalls of continuity implementation as it pertains to the introduction of additional risk in security. Particularly, this is applicable for organizations that rely on virtualization through things like Citrix or who may allow employees to work remotely on company issued workstations via VPN. In a modern workplace, cloud-hosted services replace many traditionally locally installed applications and data sources.
- SANS Institute, Pandemic Response Planning Policy. A SANS Institute policy template to set the foundation of how organizations plan, monitor, and react to pandemic incidents as it relates to shifting operations in affected localities/regions, employee wellness, and prevention of transmission of infectious illnesses.
The above are just a handful of the many resources available on the internet. If you browse through the linked documents, you may get the impression continuity is IT's responsibility. While it is true that IT touches nearly all aspects of business, IT cannot be solely responsible nor exert force over other departments at large to comply. Continuity planning is a cooperative, collaborative effort to protect the organization as a whole.
The First Step is the Most Important
If you've read through this article you might be reeling at the magnitude of what lies ahead. That's a normal reaction because it really isn't a trivial task.
Companies today have many more concerns and responsibilities to stay in operation. Many employees in companies have a primary role or focus and may not have very good visibility of what lies beyond. But clearly, there is much more that keeps businesses running - even during difficult times and events - than meets the eye. It should come as no surprise that companies implement roles like Data Security Officer to ensure the organization is compliant with regulation like GDPR and the newer Californian CCPA law to protect against data related incidents.
Do you know who the Emergency or Incident Response Director is in your company? If not or there isn't one, it could be you who takes up the mantle.