Keith Parris SystemsSoftware Engineer, HP - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Keith Parris SystemsSoftware Engineer, HP

Description:

The information contained herein is subject to change without ... operations can continue (many hours, possibly days), and ... in several quarters: ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 58
Provided by: keithp7
Category:

less

Transcript and Presenter's Notes

Title: Keith Parris SystemsSoftware Engineer, HP


1
Long-Distance Disaster Tolerance Technology,
Challenges, State of the Art, and Directions
  • Keith Parris Systems/Software Engineer, HP
  • Session 1520

2
Topics
  • Terminology
  • Disaster Recovery vs. Disaster Tolerance
  • Metrics
  • Basic technologies and state of the art
  • Historical Context
  • Trends
  • Challenges
  • Promising areas for future directions

3
High Availability (HA)
  • Ability for application processing to continue
    with high probability in the face of common
    (mostly hardware) failures
  • Typical technique redundancy

4
Fault Tolerance (FT)
  • Ability for a computer system to continue
    operating despite hardware and/or software
    failures
  • Typically requires
  • Special hardware with full redundancy,
    error-checking, and hot-swap support
  • Special software
  • Provides the highest availability possible within
    a single datacenter

5
Disaster Recovery (DR)
  • Disaster Recovery is the ability to resume
    operations after a disaster
  • Foundation Off-site data storage of some sort
  • Typically,
  • There is some delay before operations can
    continue (many hours, possibly days), and
  • Some transaction data may have been lost from IT
    systems and must be re-entered

6
DR Methods
  • Tape Backup
  • Expedited hardware replacement
  • Vendor Recovery Site
  • Data Vaulting
  • Hot Site

7
Disaster Tolerance vs.Disaster Recovery
  • Disaster Recovery is the ability to resume
    operations after a disaster.
  • Disaster Tolerance is the ability to continue
    operations uninterrupted despite a disaster.

8
Disaster Tolerance Ideals
  • Ideally, Disaster Tolerance allows one to
    continue operations uninterrupted despite a
    disaster
  • Without any appreciable delays
  • Without any lost transaction data

9
Quantifying Disaster Tolerance and Disaster
Recovery Requirements
  • Commonly-used metrics
  • Recovery Point Objective (RPO)
  • Amount of data loss that is acceptable, if any
  • Recovery Time Objective (RTO)
  • Amount of downtime that is acceptable, if any

10
Recovery Point Objective (RPO)
  • Recovery Point Objective is measured in terms of
    time
  • RPO indicates the point in time to which one is
    able to recover the data after a failure,
    relative to the time of the failure itself
  • RPO effectively quantifies the amount of data
    loss permissible before the business is adversely
    affected

Recovery Point Objective
Time
Disaster
Backup
11
Recovery Time Objective (RTO)
  • Recovery Time Objective is also measured in terms
    of time
  • Measures downtime
  • from time of disaster until business can continue
  • Downtime costs vary with the nature of the
    business, and with outage length

Recovery Time Objective
Time
Business Resumes
Disaster
12
Disaster Tolerance vs. Disaster Recoverybased on
RPO and RTO Metrics
Increasing Data Loss
Recovery Point Objective
Disaster Recovery
Disaster Tolerance
Zero
Zero
Recovery Time Objective
Increasing Downtime
13
Historical Context
  • 1993 World Trade Center bombing raised awareness
    of DR and prompted some improvements
  • Sept. 11, 2001 has had dramatic and far-reaching
    effects
  • Scramble to find replacement office space
  • Many datacenters moved off Manhattan Island, some
    out of NYC entirely
  • Increased distances to DR sites
  • Induced regulatory responses (in USA and abroad)

14
Trends and Driving Forces
  • BC, DR and DT in a post-9/11 world
  • Recognition of greater risk to datacenters
  • Particularly in major metropolitan areas
  • Push toward greater distances between redundant
    datacenters
  • It is no longer inconceivable that, for example,
    terrorists might obtain a nuclear device and
    destroy the entire NYC metropolitan area

15
Trends and Driving Forces
  • "Draft Interagency White Paper on Sound Practices
    to Strengthen the Resilience of the U.S.
    Financial System
  • http//www.sec.gov/news/studies/34-47638.htm
  • Agencies involved
  • Federal Reserve System
  • Department of the Treasury
  • Securities Exchange Commission (SEC)
  • Applies to
  • Financial institutions critical to the US economy

16
Draft Interagency White Paper
  • The early concept release inviting input made
    mention of a 200-300 mile limit (only as part of
    an example when asking for feedback as to whether
    any minimum distance value should be specified or
    not)
  • Sound practices. Have the agencies sufficiently
    described expectations regarding out-of-region
    back-up resources? Should some minimum distance
    from primary sites be specified for back-up
    facilities for core clearing and settlement
    organizations and firms that play significant
    roles in critical markets (e.g., 200 - 300 miles
    between primary and back-up sites)? What factors
    should be used to identify such a minimum
    distance?

17
Draft Interagency White Paper
  • This induced panic in several quarters
  • NYC feared additional economic damage of
    companies moving out
  • Some pointed out the technology limitations of
    some synchronous mirroring products and of Fibre
    Channel at the time which typically limited them
    to a distance of 100 miles or 100 km
  • Revised draft contained no specific distance
    numbers just cautionary wording
  • Ironically, that same non-specific wording now
    often results in DR datacenters 1,000 to 1,500
    miles away

18
Draft Interagency White Paper
  • Maintain sufficient geographically dispersed
    resources to meet recovery and resumption
    objectives.
  • Long-standing principles of business continuity
    planning suggest that back-up arrangements should
    be as far away from the primary site as necessary
    to avoid being subject to the same set of risks
    as the primary location.

19
Draft Interagency White Paper
  • Organizations should establish back-up
    facilities a significant distance away from their
    primary sites.
  • The agencies expect that, as technology and
    business processes continue to improve and
    become increasingly cost effective, firms will
    take advantage of these developments to increase
    the geographic diversification of their back-up
    sites.

20
Ripple effect of Regulatory Activity within the
USA
  • National Association of Securities Dealers
    (NASD)
  • Rule 3510 3520
  • New York Stock Exchange (NYSE)
  • Rule 446

21
Regulatory Activity Outside the USA
  • United Kingdon Financial Services Authority
  • Consultation Paper 142 Operational Risk and
    Systems Control
  • Europe
  • Basel II Accord
  • Australian Prudential Regulation Authority
  • Prudential Standard for business continuity
    management APS 232 and guidance note AGN 232.1
  • Monetary Authority of Singapore (MAS)
  • Guidelines on Risk Management Practices
    Business Continuity Management affecting
    Significantly Important Institutions (SIIs)

22
Resiliency Maturity Model project
  • The Financial Services Technology Consortium
    (FTSC) has begun work on a Resiliency Maturity
    Model
  • Taking inspiration from the Carnegie Mellon
    Software Engineering Institutes Capability
    Maturity Model (CMM) and Networked Systems
    Survivability Program
  • Intent is to develop industry standard metrics to
    evaluate an institutions business continuity,
    disaster recovery, and crisis management
    capabilities

23
Technologies
  • Inter-site data replication
  • Clustering for availability

24
Data Replication Technologies
  • Hardware
  • Mirroring between disk subsystems
  • Software
  • Host-based mirroring software
  • Database replication or log-shipping
  • Middleware or transaction processing monitor with
    replication functionality (e.g. HP Reliable
    Transaction Router)

25
Data Replication in Hardware
  • HP StorageWorks Continuous Access (CA)
  • EMC Symmetrix Remote Data Facility (SRDF)

26
Data Replication in Software
  • Host software disk mirroring or shadowing
  • Volume Shadowing Software for OpenVMS
  • MirrorDisk/UX for HP-UX
  • Veritas VxVM with Volume Replicator extensions
    for UNIX and Windows

27
Data Replication in Software
  • Database replication or log-shipping
  • Replication within the database software
  • Remote Database Facility (RDF) on NonStop
  • Oracle DataGuard (Oracle Standby Database)
  • Database backups plus Log Shipping

28
Data Replication in Software
  • TP Monitor/Transaction Router
  • e.g. HP Reliable Transaction Router (RTR)
    Software on OpenVMS, UNIX, Linux, and Windows

29
Data Replication in Hardware
  • Data mirroring schemes
  • Synchronous
  • Slower, but no chance of data loss in conjunction
    with a site loss
  • Asynchronous
  • Faster, and works for longer distances
  • but can lose seconds or minutes worth of data
    (more under high loads) in a site disaster

30
Basic underlying challenges, and technologies to
address them
  • Data protection through data replication
  • Geographic separation for the sake of relative
    safety
  • Careful site selection
  • Application coordination
  • Long-distance multi-site clustering
  • Inter-site link technologies choices
  • Inter-site link bandwidth and media types
    available vary widely with location
  • Cost can be very high in some cases
  • Inter-site latency due to the speed of light
  • And its adverse impact on performance

31
Inter-site Link Options
  • Sites linked by
  • DS-3/T3 (E3 in Europe) or ATM circuits from a
    telecommunications vendor
  • Microwave link DS-3/T3 or Ethernet
  • Free-Space Optics link (short distance, low cost)
  • Dark fiber where available. ATM over SONET,
    or
  • Ethernet over fiber (10 mb, Fast, Gigabit)
  • FDDI
  • Fibre Channel
  • Fiber links between Memory Channel switches (up
    to 3 km)

32
Inter-site Link Options
  • Sites linked by
  • Wave Division Multiplexing (WDM), in either
    Coarse (CWDM) or Dense (DWDM) Wave Division
    Multiplexing flavors
  • Can carry any of the types of traffic that can
    run over a single fiber
  • Individual WDM channel(s) from a vendor, rather
    than entire dark fibers

33
Bandwidth of Inter-Site Link(s)
34
Long-distance Cluster Issues
  • Latency due to speed of light becomes significant
    at higher distances. Rules of thumb
  • About 1 ms per 125 miles, one-way or
  • About 1 ms per 62 miles, round-trip latency
  • Actual circuit path length can be longer than
    highway mileage between sites
  • Latency adversely affects performance

35
Round-trip Packet Latencies
36
Inter-site LatencyActual Customer Measurements
37
Differentiate between latency and bandwidth
  • Cant get around the speed of light and its
    latency effects over long distances
  • Higher-bandwidth link doesnt mean lower latency

38
SAN Extension
  • Fibre Channel distance over fiber is limited to
    about 100-200 kilometers
  • Shortage of buffer-to-buffer credits adversely
    affects Fibre Channel performance above about 50
    kilometers some vendors provide more for a price
  • Various vendors provide SAN Extension boxes to
    connect Fibre Channel SANs over an inter-site
    link like SONET, DS-3, ATM, Gigabit Ethernet, IP
    network, etc.
  • See SAN Design Reference Guide Vol. 4 SAN
    extension and bridging
  • http//h20000.www2.hp.com/bc/docs/support/SupportM
    anual/c00310437/c00310437.pdf

39
Long-distance SynchronousHost-based Mirroring
Software Tests
  • OpenVMS Host-Based Volume Shadowing (HBVS)
    software (host-based mirroring software)
  • Synchronous mirroring product
  • SAN Extension used to extend SAN using FCIP boxes
  • AdTech box used to simulate distance via
    introduced packet latency
  • No OpenVMS Cluster involved across this distance
    (no OpenVMS node at the remote end just data
    vaulting to a distant disk controller)

40
Long-distance HBVS Test Results
41
Mitigating the Impact of Distance
  • Do transactions as much as possible in parallel
    rather than serially
  • May have to find and eliminate hidden
    serialization points in applications and system
    software
  • Minimize number of round trips between sites to
    complete a transaction

42
Minimizing Round Trips Between Sites
  • Some vendors have Fibre Channel SCSI-3 protocol
    tricks to do writes in 1 round trip vs. 2
  • e.g. Ciscos Write Acceleration

43
Mitigating Impact of Inter-Site Latency
  • How applications are distributed across a
    multi-site cluster can affect performance when a
    distributed lock manager is involved (e.g. Oracle
    RAC, OpenVMS Cluster or TruCluster)
  • But this represents a trade-off among
    performance, availability, and resource
    utilization

44
Application Scheme 1Hot Primary/Cold Standby
  • All applications normally run at the primary site
  • Second site is idle, except for volume shadowing,
    until primary site fails, then it takes over
    processing
  • Performance will be good (all-local locking)
  • Fail-over time will be poor, and risk high
    (standby systems not active and thus not being
    tested)
  • Wastes computing capacity at the remote site

45
Application Scheme 2Hot/Hot but Alternate
Workloads
  • All applications normally run at one site or the
    other, but not both data is shadowed between
    sites, and the opposite site takes over upon a
    failure
  • Performance will be good (all-local locking)
  • Fail-over time will be poor, and risk moderate
    (standby systems in use, but specific
    applications not active and thus not being tested
    from that site)
  • Second sites computing capacity is actively used

46
Application Scheme 3Uniform Workload Across
Sites
  • All applications normally run at both sites
    simultaneously surviving site takes all load
    upon failure
  • Performance may be impacted (some remote locking)
    if inter-site distance is large
  • Fail-over time will be excellent, and risk low
    (standby systems are already in use running the
    same applications, thus constantly being tested)
  • Both sites computing capacity is actively used

47
Work-arounds being used today
  • Multi-hop replication
  • Synchronous to nearby site
  • Asynchronous to far-away site

48
Promising areas for investigation
  • Replicate at higher levels to reduce round-trips
    and replication volumes
  • e.g. replicate transaction (a few hundred bytes)
    with Reliable Transaction Router instead of
    having to replicate all the database page updates
    (often 8 kilobytes or 64 kilobytes per page) and
    journal log file writes behind a database
  • or replicate database transactions instead of
    database disk writes

49
Promising areas for investigation
  • Parallelism as a potential solution
  • Rationale
  • Adding 30 milliseconds to a typical transaction
    for a human may not be noticeable
  • Having to wait for many 30-millisecond
    transactions in front of yours slows things down
  • Applications in the future may have to be built
    with greater awareness of inter-site latency
  • Promising direction
  • Allow many more transactions to be in flight in
    parallel
  • Each will take longer, but overall throughput
    (transaction rate) might be the same or even
    higher than now

50
Useful Resources
  • Tabb Research report
  • "Crisis in Continuity Financial Markets Firms
    Tackle the 100 km Question"
  • available from https//h30046.www3.hp.com/campaign
    s/2005/promo/wwfsi/index.php?mcclanding_pagejump
    idex_R2548_promo/fsipaper_mcc7Clanding_page

51
Useful Resources
  • Disaster Recovery Journal
  • http//www.drj.com/
  • Continuity Insights Magazine
  • http//www.continuityinsights.com//
  • Contingency Planning Management Magazine
  • http//www.contingencyplanning.com/
  • All are high-quality journals. The first two are
    available free to qualified subscribers
  • All hold conferences as well.

52
Keith Parris
  • Case studies
  • a large Credit Union
  • New York clearing firms

53
Questions?
54
Speaker Contact Info
  • Keith Parris
  • E-mail Keith.Parris_at_hp.com or keithparris_at_yahoo.c
    om
  • Web http//www2.openvms.org/kparris/

55
(No Transcript)
56
get connected
People. Training. Technology.
57
(No Transcript)
58
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com