Title: Paretos Law and IT Business Continuity Solve 80% of Your Problems DisasterProof the Mission Critical
1Paretos Law and IT Business ContinuitySolve 80
of Your Problems Disaster-Proof the Mission
Critical Aspects of Your Infrastructure
- AITP Meeting April 22, 2008 IT Disaster
Preparedness Start Planning Yesterday - Presented by Bob Lamendola
2Bad Things Do Happen
- Natural disasters
- Man-made disasters
- Pandemics
- System failures
- Scheduled maintenance
- Human error
- Personnel events
- Terrorism
3Recent Examples
Northeast Blackout - August, 2003
Hurricane Katrina August, 2005
MTA Strike, NYC - December, 2005
Avian Influenza - Ongoing
4Recent Examples
Steam Pipe ExplosionNYC July 19, 2007
5Revealing Research
- According to a survey of US companies conducted
by Info-Tech Research Group, more than 60 percent
of IT departments did not have formal plans and
procedures in place to deal with the East Coast
blackout. - Although more than 76 percent of companies
surveyed said that the blackout had an impact on
their organization, most of them admitted that
they were not sufficiently prepared.
6Regulatory Compliance is Not an Option
- Sarbanes-Oxley
- HIPAA
- Gramm-Leach-Bliley
- California Act
- Basel II
- CLERP-9
- New Federal Rules of Civil Procedure
- Email e-discovery rules
7Definitions
- High Availability
- High availability refers to a system or component
that is continuously operational for a desirably
long length of time. Availability can be measured
relative to "100 operational" or "never
failing." A widely-held but difficult-to-achieve
standard of availability for a system or product
is known as "five 9s" (99.999 percent)
availability. Source TechTarget Data
Center Media - Disaster Recovery
- Duplicating computer operations after a
catastrophe occurs, such as a fire or earthquake.
It includes routine off-site backup as well as a
procedure for activating vital information
systems in a new location. Source PC
Magazine - Business Continuity
- Business continuance (sometimes referred to as
business continuity) describes the processes and
procedures an organization puts in place to
ensure that essential functions can continue
during and after a disaster. Business continuance
planning seeks to prevent interruption of
mission-critical services, and to reestablish
full functioning as swiftly and smoothly as
possible. Source Bitpipe.com
8Business Impact Analysis
- What is a Business Impact Analysis?
- It is a technique for identifying both tangible
and intangible impacts on a business process,
function or department usually over time, based
on given criticalities. - A Business Impact Analysis
- Provides senior management with the information
needed to devise a recovery strategy and recovery
prioritization - Provides supporting data to define an appropriate
DR program budget - Identifies who and what are vital to the
businesss survival - Internal suppliers, customers, shareholders, IT
systems, manufacturing processes - External government departments, regulators,
trade bodies, competitors, pressure groups - Evaluates recover priorities and time scales
- Criticality of each function to business survival
- Assesses the potential cost of disaster
- Direct and indirect costs of loss of service
capability
9Business Impact Analysis
- Identifies the high risk areas of the existing
infrastructure - Single points of failure
- Recovery time limitations
- Identifies the business critical applications and
the systems they run on - Identifies the areas of vulnerability within the
environment - Focuses on the delivered service
- Business applications like CRM, order processing,
dispatch, and billing - Internal applications like payroll and HR
- Communications like email and Web sites
- Answers how not having the capability affects the
business - Is the application critical to the business?
- Is the function duplicated elsewhere?
- What viable alternatives exist?
10Contingency Plan Criteria
- Factors to consider
- The scale of the organization and its IT systems
- The nature of the operation
- An online system may need to be restored within
hours, whereas a customer billing operation may
not be harmed by a few days delay, if no data is
lost - The relative costs of different options
- A company with several linked sites may be able
to move operations to an alternative site - The perceived likelihood of disaster occurring
- Companies in earthquake zones are likely to
invest more in disaster recovery than average
11Recovery Objectives
- Recovery Time Objective (RTO)
- The period of time within which technical
services and / or business functions must be
recovered and available after an outage (e.g. one
business day) measured from the time of disaster
to the resumption of production operations. - Recovery Point Objective (RPO)
- The acceptable level of data loss exposure
following an unplanned event. This is the point
in time (prior to the disaster) to which lost
data can be restored typically the last backup
taken offsite. -
12Frequency of Downtime
Frequency
Type of Disaster Scenario
13The Business Critical 80 Then and Now
- 10 years ago, financial applications were the
top priority. Today, an organizations mission
critical areas are - Communications
- Email, handhelds, telecommunications
- Revenue generating systems
- Order entry systems, payment processing systems
- Backend operations
- Financial systems, ERP systems
14Paretos Law
- In 1906, Paretos Principle was born. An Italian
economist, Vilfredo Pareto, observed that 20
percent of his countrys people owned eighty
percent of the wealth. This principle was
broadened in the mid-20th century by Dr. Joseph
Juran, who penned a universal rule called the
vital few and trivial many the principle that
20 percent of input is always responsible for 80
percent of the output. Jurans work, although
expanding widely on Paretos, remained known as
Paretos Principle, or the 80/20 rule. - Paretos Principle applies to business continuity
management. 20 percent of the threats to an
organization will result in 80 percent of
invocations. The business continuity managers
main task is to identify the Pareto Principle
risks and mitigate these. These risks will not
be the headline grabbers, they will be the
mundane threats of fire, and flood of natural
disaster of loss of critical IT and telecoms
systems and of loss of human resources. - Source David Honour, Continuity Central
15Disaster-Proofing Against the 20
16Cold Site
Version 1 Pre-designated equipment resident at
alternate location, not typically used for any
other purpose but DR Version 2 Contract for
equipment/facility used on a temporary basis,
during declared emergency - several providers
offer these services
17Warm Site
Pre-designated equipment resident at an alternate
location, not typically used for any other
purpose but DR, periodically refreshed with live
data Data refresh can be accomplished in a number
of ways, including leased line and tape
18Hot Site
Pre-designated equipment resident at alternate
location May be used for purposes other than DR,
with real-time or near real-time replication of
data
19DR Configurations Recap
20Technical Server Contingency Planning Solutions
21System Backups
- Servers can be backed up through a distributed
system, in which each server has its own drive,
or through a centralized backup device. Four
types of system backup methods are available to
preserve servers data - Full
- Captures all files on disk
- Incremental
- Captures files that were created or changed since
the last backup - Differential
- Backup of stored files that were created or
modified since the last full backup - Block Level Backup
- Works like a differential backup, but the files
are backed up at the block level, which reduces
the space requirement
22RAID Redundant Array of Independent Disks
- Provides disk redundancy and fault tolerance
for data storage and decreases mean time between
failures. Raid is used to mask disk drive and
disk controller failures. RAID technology uses
three data redundancy techniques and 5 RAID
levels to provide levels of redundancy. - Mirroring
- Writes data simultaneously to separate hard
drives - Parity
- A technique of determining whether data has been
lost or overwritten - Striping
- Improves the performance of the hardware array
controller by distributing data across all drives
23Standby Servers
- Servers can be pre-built and staged in an
off-site location. At the point in time that a
disaster recovery plan is called in effect, the
servers will be put into operation and data
restoration to restore the services of the
effected resource.
24Electronic Vaulting and Remote Journaling
- These are similar technologies that provide
additional data backup capabilities, with backups
made to remote tape drives over communication
links. Remote journaling and electronic vaulting
enable shorter recovery times and reduced data
loss should the server be damaged between
backups.
25Server Load Balancing
- This technology increases the server
application availability. Through load balancing,
traffic can be distributed dynamically across
groups of servers running a common application so
that no single sever is overwhelmed. With this
technique, a group of servers appear as a single
server to the network. Using load balancing among
different sites can enable the application to
continue to operate as long as one or more sites
remain operational.
26Synchronous Server Replication/Mirroring
- This method uses a disk-to-disk copy and
maintains a replica of the database or file
system by applying changes to the replicating
server at the same time changes are applied to
the protected server. With synchronous
mirroring, the RTO can be minutes. Mirroring
should be used for critical applications that can
accept little or no downtime or no data loss.
27Disaster Recovery Planning Cycle
28Bottom Line
- The threats are very real
- Todays organizations need to prepare for when
- Business requirements should drive the plan that
disaster-proofs against the 20 - Compliance is a fact of life
- Periodically test your DR solution
- Sleep well at night!
29Thank You!
Bob Lamendola 631-864-0311Bob.Lamendola_at_mindshi
ft.com