December 2003 Mfg.Trust

Mfg.Trust is a monthly feature of the
            NCMS InfraGard Manufacturing Industry Association
                        Infrastructure assurance for manufacturers
                                    Powered by NCMS.

This month – Disaster Recovery
…with an Information Technology flavor

See the Resources Page for this Story 


Editor's Preface:

This month’s article focuses on disaster recovery from the point of view of the Information Technology group, although the lessons generalize easily. The source of this information is Tom Raupp’s Disaster Recovery Template, which he uses in his MBA program instruction at Walsh College (http://www.walshcollege.edu/). Tom also serves as Manager, Global Emergency Response Services, for Delphi Corporation.

We are indebted to Mr. Raupp for permission to use his material. For impact, we took the liberty of re-writing this as instructions for <your company>. In fact, this is a significant abbreviation of those instructions. You may request the full template by contacting us as indicated at the bottom of this article.

As always, our Resources Section at http://trust.ncms.org contains a rich set of links for further reading.

Editor


DISASTER RECOVERY

Overview

Disaster Recovery/Business Recovery (DR/BR) is a process. The primary objective of this process is to enable <your company> to survive a disaster and to reestablish normal business operations in a timely and efficient manner. In order to complete this goal, the facility must assure critical operations can resume normal processing within a reasonable time. DR/BR Plans must be created, implemented and tested to ensure the <your company> organization can resume critical business operations and processes not to jeopardize the expectations of <your company> customers and/or stakeholders.

Managers will be responsible for the development of procedures within their own business unit area. An effective recovery plan is a Living Document, which will require the appropriate resources to keep it current.

DR/BR Plans must provide a complete, consistent, and practical statement of all actions, roles, and responsibilities prior to, and after a disaster that will ensure minimum disruption of services to the affected business unit. This program is developed and coordinated in conjunction with the <your company> Corporate Emergency Response plan and will be a joint responsibility Emergency Response Services (ERS), <your company> Corporate Security and the <your company> Management.


Recovery Strategies

In the event of a disaster or service interruption, it is critical that we ensure the timely resumption of services on any application or hardware-based entity. We need to resume operations on a failed system or application within the timeframe determined by internal criteria (Business Impact Assessment – see below) or external criteria (End User Agreement, or Service Level objectives).

The Disaster Recovery Plan (DRP) should include the following documents or guidelines:
1. DR Team Members and their contact numbers
2. Business Impact Assessment
3. Business Resumption Plan which includes Server/Network Documentation,
Hardware and Software Support contracts and contact numbers
4. Backup Documentation
5. Restore Documentation


The Disaster Recovery Team

This team needs to be available and responsive in the event of a disaster. This team should include representatives of management and members of the DR/BR department and associated infrastructure representatives that will perform the initial assessment and will advise management of the necessary remediation strategies.

End User and/or Service Level objectives will determine what level of recovery will be needed to insure customer expectations or contracted requirements are met or exceeded.
 


Determine/Evaluate Potential Sources of Outage

Identify potential threat sources of possible outages. Items that should be addressed include the following scenarios: Hardware failure, Network failure, Software corruption, Malicious attack, Facility Disaster or event that would prevent access to the systems or facility by employees.

In most cases, hardware failures can be repaired within a few hours, though this is dependent upon the level of contracted vendor support.

The same is true if a failure is caused by network equipment. If the failure is caused by a disruption of phone service or network connectivity the length of time to recover may not be knowable and may not be within your control.

Software error can be a user-error that causes corrupted or erroneous data to be committed to/or deleted from a database. It can also be the result of a software patch that that has not been properly tested.

Malicious attacks can be any form of an attack with the goal of stopping or corrupting your operation. Threats include user error, hardware damage, malicious software, denial of service, and virus attacks from the inside or outside of the facility.

Severe weather, fires and explosions can destroy a computer room or building. Even if the room or building is not destroyed, these events can prevent users from accessing the computers by disrupting communication or network facilities.

Although the system and its applications are stable and operational, if your support personnel are victims of a natural or man-made disaster, you have a disaster from which to recover. Employees that are directly affected will need time to recover from their own emergency situation before they can be productive if elect to return to work.


Business Impact Assessment (BIA)

<Your company> must determine the time frame in which recovery from a disaster must be completed. Factors that need to be included within the assessment include: Cost of the Failure vs. Loss of Revenue; Cost of recovery vs. loss of revenue, and acceptable temporary procedures to restore the service offerings during the failure.

Acceptable outage restoration timelines must be determined for each type of potential outage: Hardware/Network failure, Software error/corruption, Malicious attack, Facility disaster and Loss of personnel.

Disaster Recovery and Business Resumption Plans should be assigned a restoration priority dependant on the application or function. Categories can include Critical, Essential, or General. Guidelines for maximum allowable down times need to be developed and assigned to each category.


Document the Restoration Process

The Recovery documents ideally should be written in plain English with the assumption that a skilled person with a limited understanding of your particular operating system or applications can follow the restoration steps necessary to rebuild or restore the application or hardware.


Develop and Document Backup Plan

The frequency of backup’s tapes should comply with established departmental guidelines such as weekly full backups with daily incremental backups. Backup tapes should be stored at an off-site location, and if necessary stored at a contracted tape facility on a weekly basis. Times for tape delivery from the storage facility or off site location must be taken into account with developing your timeline for system or application recovery.

Backup procedures must be documented with step-by-step details on how the backups are completed, when they are done, how they are logged and what is backed up.


Develop and Test Recovery Plan

You must test and document all steps or instructions associated with your recovery plan. Perform various types of restoration activities that will ensure that newly developed procedures are correct and accurate.


Recovering from a Disaster

The guidelines that you have developed will help you recover from a hardware or software failure. If you experience an event that disables the current office or hosting space, develop the necessary steps and procedures to host this service or application from another location or facility that can be connected or rerouted using existing network resources.


Maintaining Readiness

Now that you have documented your tested recovery strategies and tape backups, you must be diligent in keeping your documentation up to date. If you change anything about how backups are done, document it!

It is a good idea to develop self-audit checklists to document your diligence. Every time you do a restore, because a user accidentally deleted a file or because you are refreshing the database, always document your actions. Constant documenting may seem tedious, but ongoing maintenance of accurate plans is critical to the success of the recovery. It is critical that change management processes be followed and is included as a part of plan maintenance. It is critical that members of the DR Team are regularly kept up to date on new issues or procedures that will promote a successful recovery.


Conclusion

If you think that the ideas in article look like the tip of an iceberg, you are completely correct. Disaster Recovery / Business Recovery may be just applied common sense, but your company needs to take the time to think through your situation and your particular needs to arrive at an efficient program – a program that delivers good value and balanced benefits for modest investment. Outside help can be a powerful asset for this purpose.

Please take a moment to review the Resources Page. You will find more detail there.
 


LINKS

http://trust.ncms.org, select ‘Publications Index’ tab to find:
Business Continuity Planning, a special feature July-Aug 2002 Corner.Office


If you liked Mfg.Trust, please forward it to a colleague in your company!

For a copy of Mr. Raupp’s template, for questions, comments, or for NCMS Alliance Partners to request their own FREE subscription to Mfg.Trust, send email to johns@sheridansolutions.com

To unsubscribe, please send an email to listserv@listserv.ncms.org and insert the words "unsubscribe mfgtrust", without the quotes, in the BODY of the message. This is a moderated list.

 

 
Please check out these related sites

Copyright 2004
National Center for Manufacturing Sciences