December 2003 Mfg.Trust
Mfg.Trust is a monthly feature of the
NCMS InfraGard Manufacturing Industry Association
Infrastructure assurance for manufacturers
Powered by NCMS.
This month – Disaster Recovery
…with an Information Technology flavor
See the Resources Page
for
this Story
Editor's Preface:
This month’s article focuses on disaster recovery from the point of
view of the Information Technology group, although the lessons generalize
easily. The source of this information is Tom Raupp’s Disaster Recovery
Template, which he uses in his MBA program instruction at Walsh College (http://www.walshcollege.edu/).
Tom also serves as Manager, Global Emergency Response Services, for Delphi
Corporation.
We are indebted to Mr. Raupp for permission to use his material. For
impact, we took the liberty of re-writing this as instructions for <your
company>. In fact, this is a significant abbreviation of those
instructions. You may request the full template by contacting us as
indicated at the bottom of this article.
As always, our Resources Section at
http://trust.ncms.org contains a rich
set of links for further reading.
Editor
DISASTER RECOVERY
Overview
Disaster Recovery/Business Recovery (DR/BR) is a process. The primary
objective of this process is to enable <your company> to survive a
disaster and to reestablish normal business operations in a timely and
efficient manner. In order to complete this goal, the facility must assure
critical operations can resume normal processing within a reasonable time.
DR/BR Plans must be created, implemented and tested to ensure the <your
company> organization can resume critical business operations and
processes not to jeopardize the expectations of <your company> customers
and/or stakeholders.
Managers will be responsible for the development of procedures within
their own business unit area. An effective recovery plan is a Living
Document, which will require the appropriate resources to keep it current.
DR/BR Plans must provide a complete, consistent, and practical statement
of all actions, roles, and responsibilities prior to, and after a disaster
that will ensure minimum disruption of services to the affected business
unit. This program is developed and coordinated in conjunction with the
<your company> Corporate Emergency Response plan and will be a joint
responsibility Emergency Response Services (ERS), <your company> Corporate
Security and the <your company> Management.
Recovery Strategies
In the event of a disaster or service interruption, it is critical that
we ensure the timely resumption of services on any application or
hardware-based entity. We need to resume operations on a failed system or
application within the timeframe determined by internal criteria (Business
Impact Assessment – see below) or external criteria (End User Agreement,
or Service Level objectives).
The Disaster Recovery Plan (DRP) should include the following documents or
guidelines:
1. DR Team Members and their contact numbers
2. Business Impact Assessment
3. Business Resumption Plan which includes Server/Network Documentation,
Hardware and Software Support contracts and contact numbers
4. Backup Documentation
5. Restore Documentation
The Disaster Recovery Team
This team needs to be available and responsive in the event of a
disaster. This team should include representatives of management and
members of the DR/BR department and associated infrastructure
representatives that will perform the initial assessment and will advise
management of the necessary remediation strategies.
End User and/or Service Level objectives will determine what level of
recovery will be needed to insure customer expectations or contracted
requirements are met or exceeded.
Determine/Evaluate Potential Sources of Outage
Identify potential threat sources of possible outages. Items that
should be addressed include the following scenarios: Hardware failure,
Network failure, Software corruption, Malicious attack, Facility Disaster
or event that would prevent access to the systems or facility by
employees.
In most cases, hardware failures can be repaired within a few hours,
though this is dependent upon the level of contracted vendor support.
The same is true if a failure is caused by network equipment. If the
failure is caused by a disruption of phone service or network connectivity
the length of time to recover may not be knowable and may not be within
your control.
Software error can be a user-error that causes corrupted or erroneous data
to be committed to/or deleted from a database. It can also be the result
of a software patch that that has not been properly tested.
Malicious attacks can be any form of an attack with the goal of stopping
or corrupting your operation. Threats include user error, hardware damage,
malicious software, denial of service, and virus attacks from the inside
or outside of the facility.
Severe weather, fires and explosions can destroy a computer room or
building. Even if the room or building is not destroyed, these events can
prevent users from accessing the computers by disrupting communication or
network facilities.
Although the system and its applications are stable and operational, if
your support personnel are victims of a natural or man-made disaster, you
have a disaster from which to recover. Employees that are directly
affected will need time to recover from their own emergency situation
before they can be productive if elect to return to work.
Business Impact Assessment (BIA)
<Your company> must determine the time frame in which recovery from a
disaster must be completed. Factors that need to be included within the
assessment include: Cost of the Failure vs. Loss of Revenue; Cost of
recovery vs. loss of revenue, and acceptable temporary procedures to
restore the service offerings during the failure.
Acceptable outage restoration timelines must be determined for each type
of potential outage: Hardware/Network failure, Software error/corruption,
Malicious attack, Facility disaster and Loss of personnel.
Disaster Recovery and Business Resumption Plans should be assigned a
restoration priority dependant on the application or function. Categories
can include Critical, Essential, or General. Guidelines for maximum
allowable down times need to be developed and assigned to each category.
Document the Restoration Process
The Recovery documents ideally should be written in plain English with
the assumption that a skilled person with a limited understanding of your
particular operating system or applications can follow the restoration
steps necessary to rebuild or restore the application or hardware.
Develop and Document Backup Plan
The frequency of backup’s tapes should comply with established
departmental guidelines such as weekly full backups with daily incremental
backups. Backup tapes should be stored at an off-site location, and if
necessary stored at a contracted tape facility on a weekly basis. Times
for tape delivery from the storage facility or off site location must be
taken into account with developing your timeline for system or application
recovery.
Backup procedures must be documented with step-by-step details on how the
backups are completed, when they are done, how they are logged and what is
backed up.
Develop and Test Recovery Plan
You must test and document all steps or instructions associated with
your recovery plan. Perform various types of restoration activities that
will ensure that newly developed procedures are correct and accurate.
Recovering from a Disaster
The guidelines that you have developed will help you recover from a
hardware or software failure. If you experience an event that disables the
current office or hosting space, develop the necessary steps and
procedures to host this service or application from another location or
facility that can be connected or rerouted using existing network
resources.
Maintaining Readiness
Now that you have documented your tested recovery strategies and tape
backups, you must be diligent in keeping your documentation up to date. If
you change anything about how backups are done, document it!
It is a good idea to develop self-audit checklists to document your
diligence. Every time you do a restore, because a user accidentally
deleted a file or because you are refreshing the database, always document
your actions. Constant documenting may seem tedious, but ongoing
maintenance of accurate plans is critical to the success of the recovery.
It is critical that change management processes be followed and is
included as a part of plan maintenance. It is critical that members of the
DR Team are regularly kept up to date on new issues or procedures that
will promote a successful recovery.
Conclusion
If you think that the ideas in article look like the tip of an iceberg,
you are completely correct. Disaster Recovery / Business Recovery may be
just applied common sense, but your company needs to take the time to
think through your situation and your particular needs to arrive at an
efficient program – a program that delivers good value and balanced
benefits for modest investment. Outside help can be a powerful asset for
this purpose.
Please take a moment to review the Resources
Page. You will find more detail there.
LINKS
http://trust.ncms.org, select
‘Publications Index’ tab to find:
Business Continuity Planning, a special feature July-Aug 2002
Corner.Office
If you liked Mfg.Trust, please
forward it to a colleague in your company!
For a copy of Mr. Raupp’s template, for questions, comments, or for NCMS
Alliance Partners to request their own FREE subscription to Mfg.Trust,
send email to
johns@sheridansolutions.com
To unsubscribe, please send an email to
listserv@listserv.ncms.org
and insert the words "unsubscribe mfgtrust", without the quotes, in the
BODY of the message. This is a moderated list.
|