Business Continuity

Quick Links to Topics on This Page:


There are two concepts to consider when planning for disaster recovery and data protection:

  • Business Continuity (BC)
  • Disaster Recovery


Business Continuity (BC) Concepts

The concept of Business Continuity is the holistic approach of defining guidelines and procedures for the continuation of a business in the face of any disaster situation. From a technical aspect, high availability and Disaster Recovery relate to business continuity in the sense that a well planned and executed business continuity strategy includes both high availability and disaster recovery components. How and when high availability and disaster recovery strategies are put into place is based on the RTO and RPO values of each business-critical system.


Consider the following critical BC points and questions as they relate to DR planning:

  • Facilities – How secure is the main data center? Is the air conditioner right on top of the data center? How reliable is the power source? Is there a generator? How often is it tested? How much fuel does it have?
  • Chain of command – Who is in charge when the person in charge is not there? Who's next on the list? Who on the management team do you contact if you need to make substantial emergency purchase? What are ALL methods to contact ALL people in the chain?
  • Communication – Who is our cellular provider and what are their contingency plans in the event of disaster? Who is responsible for communicating with them? In the case of disaster, how will management communicate with employees on status updates?
  • Contingencies – What happens when Disaster Recovery plans need to be changed? How does the company deal with extended outages such as utilities where the ability to restore power or communication is out of the company's hands?
  • Continuation of business – How will employees work if there is no facility to work from? How will they access resources? How will they communicate?



Business Continuity concepts showing management, high availability, and disaster recovery methods






Disaster Recovery Concepts

Disaster recovery or 'DR' is much more than backing up data and sending it off-site. Like other areas of technology, disaster recovery has been refined to a science encompassing all aspects of data protection, data preservation and data recovery. This science has been molded to a point where several key concepts and definitions are commonly used when planning, testing and implementing DR plans.


The following information provides a high level overview of each of these concepts:

  • Service Level Agreement (SLA)
  • Recovery Time Objective (RTO)
  • Recovery Point Objective (RPO)
  • Gap Analysis
  • Risk Assessment
  • Cost Reduction vs. Risk Reduction


Service Level Agreement (SLA)

An SLA defines a guaranteed "response time" or "resolution time" for various incidents that may occur within your enterprise. It is a contract between a business owner and the IT department. An overall SLA takes into account different objectives like Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO and RPO will be applied on two different levels (business level and system level) and then integrated and articulated in an organizational DR Plan within an enterprise. It is important to apply different measures of RTO and RPO on these levels to determine the priority or sequence of recovery since resources such as power, hardware and bandwidth may be scarce when a disaster strikes.


Recovery Time Objective (RTO)

An RTO defines the time to recover a business system. Depending on the level of disaster and defined SLAs the RTO may be based on recovery time from point of disaster or from the point where the recovery process begins. This will be determined by the level of disaster and should be quantified by business system owners and other business units. Technologies such as clustering, virtualization and disk replication/mirroring are implemented with the intention to reduce and in some cases eliminate system outages. These systems provide a level of high availability that, when planned right, can guarantee a high level of up time. However, it is important to properly understand the type of disasters that may occur and how they might affect RTO.


Recovery Point Objective (RPO)

An RPO defines the frequency of recovery points that will be created and maintained. Another way of looking at an RPO is that it defines the acceptable amount of data loss that can occur. If backups are being conducted daily it will result in a 24 hour RPO. If tapes are being sent off-site weekly, then the RPO can potentially increase to seven days. Mission critical systems such as databases will typically conduct transaction log backups at short intervals (10-20 minutes) which results in shorter RPOs. Snapshots, synchronous or asynchronous replication and off-site vaulting replication technologies are also commonly implemented to shrink RPO times. RPO values are just as important, if not more than, RTO values. Not meeting an RTO could cost your company money in lost production but not meeting an RPO could result in data loss that may never be able to be recovered.


Gap Analysis

Gap Analysis is a process in which business units define SLA values for various business systems and then pass them along to technical teams. The technical teams conduct tests to establish current capabilities to meet SLAs. Gap analysis is then performed to see if the established SLAs can be met. If it cannot be met, the technical team must address shortcomings and adapt to better meet the business unit's requirements. In some cases procedural adjustments can be made to better meet the business needs. In other cases additional investments must be made to meet SLA requirements. If the business unit's needs cannot be met or budget limitations prevent gap reduction, then the business units must redefine their SLAs to be more in line with the realistic capabilities of the technical teams.


Another key point regarding gap analysis is that each business unit will always think that their systems are the most important. Fairly determining system priority and properly defining SLAs is sometimes a better fit for outside consultants or auditors. If outside consultants are to be used it is important that they do not represent specific products and technologies as they will sometimes push what they want and not provide the best solution for your situation. Auditors can be a big benefit as their knowledge of compliance requirements such as Sarbanes-Oxley can be used to push through technology upgrades and change legacy processes that impede progress towards providing a sound disaster recovery strategy.


Risk Assessment

Risk Assessment is a companywide coordinated effort to address the likelihood of a disaster, the effect it may have on business and the cost involved in preparing for it. Risks such as air conditioner leaks, fire, hacking or sabotage are disaster situations that every company should be prepared to deal with. Major disasters such as tornado, hurricane, volcanic eruption or terrorist attack are more complicated disasters that, depending on the nature of a business may or may not be considered in a DR plan. This may sound contrary to what a DR course should state, but the truth is that location, disaster probability, nature of the business and data being protected will all factor in planning a sound DR strategy.


If you work for a small company on the outskirts of Mt. Rainier, the potential of a volcanic eruption and the cost in defining short SLAs, which may be defined for an air condition leak, may not be worth the money and effort when the likelihood of an eruption is very small. In this case the cost associated with meeting short SLAs for an eruption would be substantially greater than an air condition leak. On the other hand if you work for a major bank in the same location, short SLAs would most likely be required. The point here is not that a DR plan should not be put in place, but rather the SLAs for the various levels of disaster should be realistically weighed on a cost/benefit scale before investing in meeting SLA requirements. Not all disasters are created equal so risk assessment should be considered at various disaster levels: business system outage, limited site disaster, site disaster and regional disaster.


Cost Reduction vs. Risk Reduction

Companies are always seeking out ways to reduce costs. In some cases reducing costs results in a compromise in effective DR planning. In other cases cost reduction can actually benefit DR planning. Consider the virtualization of data centers as a cost savings measure that actually benefits DR planning. It would be impractical to request 100 standby servers at a DR site for most companies but to request four servers to host virtual machines may be in the budget. The choice to terminate a contract with a dedicated DR facility might at first seem to be a negative towards DR planning but if the company has another facility a few towns over, it may be a perfect location for a new (and cheaper) DR facility. With bandwidth becoming considerably cheaper and Commvault features such as deduplication, DASH Full and DASH Copy, a branch office can be quickly and inexpensively converted into a hot DR site.


In some cases cost reduction can have a negative effect on DR. Consider deduplication, being the big concept in data protection. When blocks are deduplicated they are only stored once. In this case the cost reduction in disk storage is countered by an increased risk in a corrupt block affecting the ability to recover data. This is the concept of cost reduction vs. risk reduction. Saving money in disk storage results in an increased risk. Another example is implementing archiving solutions where data is moved to secondary storage to free up space in production. Like deduplication, this results in data being stored in one location which may increase risk. Using technologies such as deduplication and archiving can be methods of reducing risk without increasing cost. When the Commvault software is configured properly and Commvault best practices are followed, cost and risk reduction can both be achieved.




Disaster levels

Not all disasters are created equal and SLAs should not be defined as a one size fits all. The severity of a disaster, the disruption of communication and transportation, and the accessibility of key personnel can all have an impact on meeting SLAs. These factors should be considered and in situations where SLAs can be adjusted depending on the disaster type. In other cases where SLAs are strict, then proper technology solutions must be implemented to ensure SLAs are met.

The following information provides a high level overview of disaster levels:

  • Business System Failure / Interruption
  • Limited Site Disaster
  • Site Disaster
  • Regional Disaster


Business System Failure / Interruption

A business system disruption will affect a single business component. Email or a database going down would be the two best examples since in each case the business disruption would be noticed by most end users. In this case, the SLA should be very short and in many circumstances this type of failure would not even be noticed by end users by implementing high availability measures such as clustering or replication.


Limited Site Disaster

A limited site disaster would affect several business systems but not prevent end users from working, they just would not be able to work on the affected business systems. A primary example of this would be a power outage caused by unforeseen circumstances such as an air conditioning leak that forces power to be cut to a portion of the data center. In this case, depending on whether or not the business systems were brought down gracefully or if there was a sudden crash, SLAs may be longer. End users will be able to work but will not have access to the offline business systems.


Site Disaster

A site disaster will affect a substantial number of business systems but will also affect the end user's ability to work. An example of this situation would be a gas line leak that forces the shutdown of all power to the building. This example is used since there is no way to avoid this situation and the time to recover may not be dependent on you doing your job. This is a case where off-site facilities would be needed for the continuation of business.


Regional Disaster

A regional disaster such as hurricane, earthquake or tornado could have a major impact not only on business systems but communication and transportation. In this case, even with a DR site, if you can't get out there to put a DR plan into action, or you are able to but end users cannot access data at the DR facility, this disaster could be more severe and therefore have a longer SLA.




Disaster Recovery Analysis

The following information provides an analysis for disaster recovery:

  • What's in Your Control and What's Not?
  • Picking the 'Low Hanging Fruit'
  • Cost Benefit Analysis
  • How Many Copies of Data do I Need and Where Should they be?
  • Who Should be Included in DR Planning?
  • How will Everyone Communicate?


What's in Your Control and What's Not?

The first thing to consider when analyzing DR readiness and addressing concerns is understanding what is and what is not in your control. Though this may be a difficult point to accept it is quite easy to understand. Lack of control can be due to management resistance, legacy administrators not wanting to change, legacy equipment and lack of budget or lack of time to plan, test and implement. Understanding what you can control will help in analyzing DR readiness and addressing what you can address. Always remember that although there may be some things out of your control, you always have the ability to influence. In some cases changes may be quick but in other cases the power of influence can be used to change mindsets over time.


Picking the 'Low Hanging Fruit'

What if a disaster happened today? How prepared would you be? Look around your environment and see what could be quickly altered to make a DR situation easier to handle. That is the low hanging fruit. Small things that could be changed or addressed that are easy to do but could go a long way in a disaster situation. Consider the following 'Low Hanging Fruit' situations that can be addressed today:

  • Communicate with other administrators and ask them what they are doing or at least have in their mind for DR preparedness. You may find that the systems they administer are in good shape if a disaster strikes, or a few tweaks to your DR methodology or theirs could better prepare you for a disaster situation.
  • Gather documentation for all systems. Gather as much information and locate it in a secure and accessible location.
  • Make small adjustments in backups such as proper filtering, backup job types, media locations or CommCell® console configurations that could potentially shrink RTO and RPO numbers.


Cost Benefit Analysis

Wouldn't it be nice if a production environment can be configured with full synchronous replication with point-in-time snapshots to additional disk arrays which are also replicated to a location where a complete setup of your entire datacenter is waiting just in case a disaster strikes? Of course this is unattainable for most environments so proper cost benefit analysis should be conducted. The cost should be factored based on implementation and training costs weighed against potential downtime and its quantitative effect on business operations.


Cost / Benefit Analysis Process

  • Communicate with business system owners and assist them in determining their requirements for RTO and RPO and associated cost of downtime. Although IT can provide assistance it is important that they come up with the final number on their own.
  • Analyze business requirements to determine how best to meet requirements and associated costs in planning and implementing the solution.
  • With all parties involved determine the cost / benefit of implementing the solutions and whether they are attainable or requirements must be readdressed.


How Many Copies of Data do I Need and Where Should they be?

A critical part of analyzing DR readiness should be how many copies of data will be required and where will the copies be located. Consider DR site or sites that may be used and also the proximity of each site. If a DR site is located five blocks from the main data center then having two copies of data may not be adequate if a regional disaster effect both locations. Having additional copies sent to locations beyond the perceived risk of a regional disaster should be factored into determining how many copies should be made.


Who Should be Included in DR Planning?

It is important that as many people as possible buy into the need for proper DR planning. This means that most who buy into the idea will want to be actively involved. In some cases using employees and outside consultants to provide expertise and mediation may be the best approach. In other cases having outside suggestions may be preferable to direct plan involvement, especially if egos start getting in the way. There is no best answer for the question of who should be involved in planning but the initial plan should determine who will actively be involved and who will not be actively involved.


How will Everyone Communicate?

The most critical aspect of DR planning should be communication. This is an absolute first step. Without communication everything will fall apart. For DR analysis, there are many communication methods that should be considered and who will be best to communicate using each method.


Consider the following methods of communicating:

  • Cell phones – This is an obvious choice but consider the effect a disaster may have on cell towers and the ability to communicate. Also consider the effect of a major disaster on phone switches and the ability to get through when disaster strikes.
  • Land lines – Not VoIP and not fancy phone systems but traditional land lines with traditional phones (no cordless phones that must be plugged into power). It may be beneficial to ensure all key personnel have land lines at the office and home.
  • Email – Clearly the most effective method of communication since one message can reach everyone and the store and forward protocol doesn't require someone to immediately answer a message. However consider corporate email being down in a disaster. Key IT and business personnel should all have third party alternate email addresses that can be used in a disaster situation.
  • Social Media – This is a powerful tool to reach out not only to employees but also customers in a disaster situation.
  • Two Way Radios – Can provide limited distance but direct communication that does not require third party assistance.
  • Satellite Phones – Provide an advantage of not relying on cell towers and can make key personnel reachable almost anywhere in the world. Consider a situation where your CEO is overseas when a disaster strikes and how important it would be to communicate no matter what the situation was.


Training & Cross Training

Training is absolutely essential to best prepare for a disaster situation. Adequately trained personnel who are trained on multiple systems that will be used in a DR scenario will not only allow for DR scenarios to be played out smoother, but also better prepare them to handle unexpected situations. Key personnel should be cross trained on as many systems as possible. Other personnel including system administrators, network administrators, or non-technical people should also be trained as much as possible.


Additionally, consider training people from different locations on the same systems. One example is the backup system. If multiple employees are trained on recovering systems, if they all are working in the same datacenter that crashed, no one can recover any systems at the disaster recovery site.




Building a Sound Data Protection Strategy


When building a protection plan, it is important to collect as much information as possible. The information can be classified in three main categories:

  • Data Description
  • Data Availability
  • Protected Storage Requirements


Data Description

When designing a data protection strategy, it is important to assess entire business systems, not just servers. In today's data centers, it is very common for business systems to have many components including backend servers, front end servers, storage resources, and network resources. Consider all of these components in their entirety to properly define protection requirements.


Identify and Classify all Components of Business System

When surveying the business environment, all the components that make up a business system should be analyzed. Who owns the system, its value to the company, the cost of downtime, the cost of recreating data, cost of data loss, servers it runs on, storage it uses, networks it relies on, etc. Each component should also be classified as IT or business. Each classified component may require different protection and retention methods.


Business Classification
Data whose primary purpose is to directly support business functions is classified as Business Data Types. This would be the actual data being managed for business purposes such as email, financial databases, home folders, or web content. If this data is lost, it could cost thousands or even millions to recreate if it can be recreated at all. Although IT may manage the servers, the data owners are ultimately responsible for the business data. DBA's, managers, Chief Officers, VP's all invest a lot of time and money to build or purchase business systems which make their work more efficient and more profitable.


The loss of the data on these systems could be catastrophic. Rebuilding a database server is easier than recreating a lost database. Business systems may require different protection requirements than the core IT data on that same system. Compliance requirements may also require the data be kept for long time periods of time, encrypted, placed on WORM media, etc.

  • Business data can be an entire system or component of system (e.g. critical database running on a database server or a sales tracking system in SharePoint).
  • Business data can be containerized into subclients. These will be used to determine different SLA's for different business systems.



IT Classification
IT data classifications include operating systems, system databases, domain controllers, DNS servers, etc... This data does not directly serve a business purpose but it is the foundation in which business systems run. The primary purpose of protecting IT data is for Disaster recovery purposes. For example, a database server has a system database, some configuration files, and an underlying operating system which all qualify as IT data. There is also a financial database that runs on the server which is classified as business. This system may require different protection and retention methods that will be defined by its owner.

  • IT systems that support business systems.
  • Dependencies required for business system to function.
    • Domain controllers.
    • Network configurations including: routers, switches, VPN, and SAN configurations.
    • Front end and back end servers.


Granular Classification of Business Data

Depending on protection and recovery requirements, business systems can be divided and categorized to meet very specific requirements. An email server would be classified as both IT and business. It must be protected for disaster recovery purposes, which is primarily a function of IT. The ability to recover or preserve specific mailboxes or mail databases is associated with business classifications. Using Commvault agents and subclients, different data can be containerized and protected to meet both business and IT requirements. This adds a level of administrative complexity but allows the administrator to implement solutions to meet business requirements.


Understand Value & Protection Requirements
The value of a business system determines protection requirements. Mission critical business systems have shorter Recovery Time Objective (RTO) and Recovery Point Objective (RPO) values. Financial and communication data may have longer retention and data preservation requirements. Each business system should be looked at granularly and protection requirements should be defined.


Gather Technical Data

Once the data has been classified, technical information must be gathered. Technical statistics in a well-organized and documented environment can be gathered through reports, documentation, and system analysis.

  • Physical location of each component of the system
  • Server location within physical or virtual environment
  • Current data size and projected growth


Location of Data
The location of data relative to storage can greatly affect the performance of data protection operations. Is the data direct attached, network attached, SAN attached? Is the data on a physical or virtual server, local or remote location, local subnet, remote, or accessed over a VPN? These questions can affect the solution to protect the data. Snapshots might be better than traditional backups; replication may be better than relying on someone at a remote location to swap tapes, or locating a MediaAgent in closer proximity to the data to avoid remote backups


Size, Change and Growth of Data
Understanding current and future storage capacity needs is essential in determining where data should go, how long it can stay there for, and if additional investment in storage is required. Predicting and trending growth expectations can be accomplished through historical reporting and analysis tools. Estimating growth requirements can allow you to anticipate storage requirements which may alter your purchase decisions for more hardware or persuade decision makers to go with more efficient storage methods such as deduplication. Not planning for future requirements can result in adjusting protection requirements to fit capacity needs. That change in policy could have negative effects on you and your company later down the road.


System & Business Dependencies
This may be one of the most overlooked aspects of providing adequate protection for data. The simplest example would be protecting an Exchange server but not protecting your Domain Controller. The thought might be "We have so many domain controllers, we don't need to protect them." Then active directory becomes corrupt or a full site disaster destroys all your domain controllers. Your dependency required to rebuild your Exchange server is now unavailable. Granted this is an extreme example but it should be noted that dependencies and the time it takes to rebuild them will influence your recovery objectives. All system dependencies should be considered for all business systems.


Business dependencies can also be important. Consider the CFO who is the only person who knows a critical password which is required before a system can be rebuilt. Consider a Web provider who must perform actions on their end so remote users can access a database on your end. The point is, when it comes to system dependencies, you should leave no stone unturned. Figure out every dependency within your environment for each system.


Production and Storage Infrastructure

Where production data is located and its proximity to protected storage plays a large role in designing storage policies. The following section addresses the three key aspects of infrastructure:

  • Production data location
  • Library configuration and placement
  • Data paths from production to storage


Production Data Location
The location of production data should be taken into consideration when planning MediaAgent placement and storage policy design. Large amounts of data being transmitted over a production network can not only slow down backup performance but also inconvenience end users (not to mention frustrate network administrators). Take the following into consideration for addressing the location of production data:

  • Direct attached – data will require movement over the network when backing up data. If possible consider multi-homing the server and connecting it to a dedicated backup network.
  • SAN attached – data can be protected using a LAN Free path if a MediaAgent is installed directly on the client. Consider using this approach when large amounts of data require protection.
  • Network attached – storage can be backed up over the network or directly into a SAN if the NAS device is capable of SAN integration. The Commvault software supports either method.
  • Remote data – can either be backed up over a WAN or a MediaAgent can be installed at the remote location. Using Commvault deduplication with client side deduplication would be the best method for protecting data over the WAN using minimal bandwidth. If a MediaAgent is at the remote location, using deduplication and DASH Copy allows data to be Auxiliary copied over the WAN using minimal bandwidth.


Chart sample for Data Description

Here is chart sample of the information that can be gathered about the data that requires protection. This can simply be reproduced in an Excel sheet, where you can segregate your data and compile information relative to it.


Data description chart sample



Data Availability


Determine Service Level Agreements

Service level agreements are used to establish protection and recovery windows and acceptable amount of data loss within those windows.

  • Recovery Time Objectives
  • Recovery Point Objectives
  • Retention Requirements:
    • On and off-site disaster recovery
    • Data recovery
    • Data preservation and compliance copies


Prioritize Data Type
Set priorities for different data types to establish its value to the company. For data protection, the priority levels will affect scheduling times, job priorities, and performance tuning to provide higher priority jobs with adequate resources. For recovery, a high priority data type can ensure certain business systems become available before others.


An example of data type prioritization would be dividing email databases into different subclients. Group higher priority mailboxes into smaller databases on the mail server and lower priority mailboxes into other databases. Consider a mail server recovery time if the total size of all databases was 600GB with mailboxes thrown into different databases with no rhyme or reason. Now consider that same server with the highest priority mailboxes in a small dedicated database about 60GB in size. The high priority database can be recovered first and the lower priority databases recovered later.


Chart sample for Data Availability

Here is the continuation of the previous chart sample. This chart displays columns related to data availability.


Data availability chart sample



Protected Storage Requirements

Identify Retention Objectives for Each Data Type

Retention objectives should be based on the three primary reasons for protecting data: Disaster Recovery, Compliance, and Data Recovery. Disaster recovery retention requirements are best to be handled by IT and should be based on how many complete sets or cycles should be kept. Compliance copies are usually point-in-time copies such as month end or quarter end and the retention should be based on how long the data needs to be kept for. Data Recovery may include all protected data within a time period (full and incremental) and the retention should be based on how far back in time data can be recovered.


Retention times can be customized for different business data types. For example, on an Exchange server there is a data recovery requirement for regular users to recover a deleted message for 60 days, but for sales people the requirement may be one year. By creating these different business data types, different retentions can be set to meet business requirements.


Chart sample for Storage Requirements

Here is the continuation of the previous chart sample. This time, it shows columns related to storage requirements.


Storage requirements chart sample



 

Design Methodology

Properly designing a CommCell® environment can be a difficult process. In some environments, a simple design may suffice, but in more complex environments, careful planning must be done to ensure data is properly protected and the CommCell® environment can properly scale to meet future requirements.


There are three phases to designing and implementing a proper solution:

  1. Plan
  2. Build
  3. Assess & Modify


The following highlights the key elements of each phase:

  • The Planning Phase – focuses on gathering all information to properly determine the minimum number of storage policies required. Careful planning in this step makes it easier to build or modify policies and subclients. The objective is to determine the basic structure required to meet protection objectives. Modifications can later be made to meet additional requirements.
  • There are three design methods that can be used during the plan phase:
    • Basic Planning Methodology which focuses on generic guidelines to building storage policies and subclients.
    • Technical Planning Methodology which focuses on technical requirements for providing a basic design strategy.
    • Content Based Planning Methodology which takes a comprehensive end-to-end approach taking into consideration all aspects of business and IT requirements as well as integrating multiple technologies for a complete solution.
  • The Build Phase – focuses on configuring storage policies, policy copies, and subclients. Proper implementation in this phase is based on proper planning and documentation from the design phase.
  • The Modification Phase – focuses on key points for meeting backup/recovery windows, media management requirements and environmental/procedural changes to modify, remove, or add any additional storage policy or subclient components. It is important to note that the 'Design-Build-Modify' approach is a cyclical process since an environment is always changing. Not only is this important for data growth and procedural changes, but it also allows you to modify your CommCell environment and protection strategies based on emerging technologies. This provides greater speed and flexibility for managing protected data as our industry continues to change at a rapid pace.