Going Beyond Recovery to Continuity: Lessons Learned - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Going Beyond Recovery to Continuity: Lessons Learned

Description:

Title: Going Beyond Recovery to Continuity: Lessons Learned Subject: EDUCAUSE Live! Sept 6, 2006 Author: David G. Swartz Last modified by: Colleen Keller – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 23
Provided by: DavidGS150
Category:

less

Transcript and Presenter's Notes

Title: Going Beyond Recovery to Continuity: Lessons Learned


1
Going Beyond Recovery to Continuity Lessons
Learned
  • Dave Swartz
  • Vice President CIO
  • The George Washington University

2
Brief Background on GW
  • Main campus
  • Washington, DC
  • 100 buildings
  • Blocks from the White House, IMF/World Bank,
    State Dept.
  • 27,000 people
  • 20K students (50 UG and 50 graduate and
    professional students)
  • 7K faculty and staff
  • Of the 20K there are 8K resident students
  • Major medical center the ER for the leadership
    of our government
  • Two other smaller campuses in region
  • 2.5 Gb into Internet and Internet-2
  • 15K voice connections and 17K data connections
  • Two major data centers 34 miles apart

GW
White House
IMF/WB, State Dept.
Pentagon
3
Some Drivers for Business Continuity at GW
  • Explosions in Man Holes in Street
  • Recurring unexplained accumulations of flammable
    liquids in the storm drains explodes shutting
    power off a few buildings for days.
  • Flood hits Academic Center with Data Center
  • A backed up city sewer system causes a flood in a
    building not designed for a data center.
  • Change Management Issues
  • Our Facilities group is prone to taking
    significant actions without much notice,
    including cutting off power or cooling to a
    building.
  • Email Systems Failure
  • Lost the SAN and was down for 24 hours for basic
    email and it was 3 days until the archive could
    be restored.
  • Cybersecurity Incidents
  • After a major worm infestation and also a hack on
    a trusted host in 2000, GW creates its
    Information Security Program.
  • 9/11
  • The tragic events of Sept. 11 and their
    aftermath have resulted in changes in the way all
    of us conduct our lives, said President Stephen
    Joel Trachtenberg. Just as GW strives for
    academic excellence, we also want to take all
    appropriate steps to ensure the safety and well
    being of our community and the continued
    operation of the university.
  • GW was close to ground zero that day and all
    land-based phones and cell phones were congested
    for much of the day.
  • Sarbanes-Oxley
  • A risk conscious Board of Trustees has lead to a
    number of initiatives to address BC at GW.

4
Who Owns BC at GW?
  • John Petrie, AVP for Public Safety Emergency
    Mgmt., holds the AB degree from Villanova
    University and a masters and doctorate from The
    Fletcher School of Law and Diplomacy.
  • A career Naval officer, he was the head of the
    Naval Station at Norfolk, the worlds largest
    Naval complex, and also professor and head of
    research at the War College.
  • The AVP position was created after 9/11 and was
    designed to broaden, coordinate, and execute the
    Universitys crisis management, business
    continuity, emergency preparedness and public
    safety plans and activities.
  • We need to have people at the local level
    comfortable with whats expected of them and what
    they have the authority to do, Petrie says. If
    they are confident and comfortable, then the
    chances of their being able to prepare, respond,
    or recover are easier.
  • Johns number one priority is the safety and
    welfare of people.
  • He sits on regional and national emergency
    management response groups and represents the
    regional universities in exercises.
  • References
  • BC Plan - http//www.gwu.edu/response/contents.cf
    m
  • Advisories and Alerts - http//www.gwu.edu/gwaler
    t/

John Petrie, AVP for Public Safety Emergency Mgt
John has help to lead the development and
administration of BC plans and testing, and an
integrated system of advisories, alerts
and real-time communications.
5
Role of IT in Campus BC
  • Address the risks of IT failures
  • IT has helped to coordinate and fund the
    development of the main 19 core office
    departmental plans
  • Many core departments had to be assisted to get
    their BC plans done since they felt IT had things
    under control, so why do they have to plan?
  • They also had difficulty freeing themselves from
    other priorities needed their VP to make BC a
    priority!
  • IT has also helped to deliver
  • Campus Alerts (web page, portal, email, 3rd party
    call service)
  • Back up web site
  • Redundant email system and broadcast server
    (reflector and Listserv)
  • Alternate routing to different area code for our
    main incoming and outgoing phone lines
  • Emergency intercom broadcasts over speaker phones
  • A network of Blackberries and support for
    management
  • Online directories and BC response plans
  • A fully configured and supported command center.

6
The Planning Process
  • Identify sources of risks and plan accordingly
  • Provide assistance
  • Standard templates and questions to facilitate
    preparation of plans (available on request)
  • Expert assistance to develop plan
  • Review of plans
  • Enlist support
  • Of senior management, the Board and all core
    offices
  • Prioritize efforts
  • Not every department needs a comprehensive plan.
    At GW we identified 19 core offices that needed
    detailed plans.
  • Make the plan easily available
  • Test the plan and the ability to think on your
    feet regularly
  • Keep plans current
  • All plans require periodic review, validation and
    update.

The online plan for GW is called the Incident
Planning, Response, and Recovery Manual,
included are individual BC Plans.
7
The GW IT Recovery Profile
  • Rebuild Replace Disaster Recovery
  • Tape backup and priority shipment of equipment
  • Weeks to recovery
  • Hot-Site Disaster Recovery
  • Off site arrangements with a hot-site provider
  • Several days to recovery
  • High Availability Operations
  • Redundant data centers, networks and telecom
  • Less than one day and ideally less than a couple
    of hours to recovery.

Hours to Recovery
420 (projected)
Rebuild Replace
Hot-Site
84
High-Availability
12
lt 2
8
Dealing with Risk Continuity rather than
Recovery
  • Common areas of IT risk were addressed with a
    focus on major risks and points of failure
  • Data Center
  • Telecommunications
  • Network and ISP
  • Data
  • Security
  • Power and Cooling
  • Change and Service Management
  • Classrooms
  1. Continuity of operations needs to be built into
    the architecture and culture from the bottom up.
  2. If you live and use it day to day then it is less
    of a big deal when a disaster hits.
  3. BC at a comprehensive local level is essential to
    enable IT to deliver the sustainability of data
    and information services.

9
Data Center Redundancy
  • We have created dual data centers
  • separated by 34 miles
  • connected by a DWDM link over a redundant dark
    fiber ring
  • We split Test/Dev from the Prod instances.
  • We also deploy VMware and virtualize servers
    across centers.
  • Not all of production is at one site, but split
    on a 35-65 basis.
  • We mirror data between data centers.
  • We have staff split between centers.
  • We routinely test failover during maintenance and
    upgrades.
  • This design enables continuity of operations
    without the need to recover from most disasters.

10
Telecommunications Redundancy
  • We have several PBX switches (Avaya S8700s)
    interconnected, load balanced, and spatially
    distributed.
  • Two are on the main campus and separated. The
    third is on a remote campus 34 miles away in a
    different area code.
  • We have the ability to re-route incoming and
    outgoing calls through different campuses and
    area codes.
  • There are redundant emergency 911 and analog
    lines as a back up to our main trunks.
  • Some specific phone numbers are protected and
    given regional priority for accessibility and
    sustainability during a major incident.
  • We maintain copper connections for voice to
    permit inline power off of diesel generators to
    15,000 phones.

11
Data Redundancy
  • All enterprise data is mirrored between data
    centers, including ERP, data marts, email,
    one-card, portal, and web systems.
  • The main campus file servers are automatically
    backed up. Legacy departmental systems are slowly
    transitioning to central support and
    sustainability a difficult political process.
  • Desktops in many core offices have a standard
    image and automatically store to a central suite
    of file servers.
  • Critical documents are being stored online in an
    enterprise document management system and
    archived to tape.
  • We regularly test data backups to make sure we
    can restore from them.
  • One of the most critical aspects of continuity is
    rapid access to the data!

On-site fire rated vault in addition to off-site
storage
12
Information Security
  • Protecting the university from security risks
    that can interrupt operations and cost millions
    of dollars in lost productivity and liability is
    an important priority in BC.
  • Like an onion, the best approach is defense in
    depth.
  • One of our newest efforts after securing campus
    file servers is our desktop initiatitive.
  • We now use Novell Patchlinks, Cisco Clean Access
    and IPS to automate updates, verify conformance
    to standards and non-infection.
  • As a result, desktop infection problems have
    declined to a trickle.
  • Creating a focused Information Security program,
    setting standards, and centralizing services, are
    critical to success.

Rounding Up Rogue Servers, article in July 2005
Chronicle.
13
Power and Cooling
  • Power Redundancy
  • Conditioned Commercial Power
  • 450KW Diesel Generator w/Maintenance Tap
  • Automatic Transfer Switch
  • Uninterruptible Power Supplies (UPS)
  • Multiple Power supplies in each computer system
  • 48 hours supply diesel (going to 96 hrs) with
    priority shipments from three regional vendors
    possible
  • Redundant Air Conditioning Systems
  • Chilled Water Plant Two 60 Ton Dry Coolers
  • Glycol Chilled Water Air Handlers

14
Change Service Management
App. Change Control
Prob Tickets Service Orders
Remedy
Kintana
Work Requests C3
Asset Management
S/W License Mgmt
Remedy
TBD
Upside
Aperture
Change Control via Integration
Adoption of integrated change control is one of
the major factors to improvement and reliability
of operations.
15
Classrooms
  • What happens if we lose some classroom space?
    How could we continue to conduct classes?
  • Using R25i (Resource25 3.3) to complement
    Schedule25 we can identify and reallocate any
    available university space to classrooms
  • Using Bb and Elluminate we can conduct classes
    virtually from home.
  • We are piloting this approach now for snow days
    and other unscheduled ad hoc gatherings such as
    study sessions.
  • We are also suggesting that faculty teach one
    virtual class every month so they have practice.
  • Podcasting Apreso iPods
  • GW is supporting Podcasting of its non-credit
    lecture series to provide access to recorded
    presentations.
  • Could this be expanded for credit classes?
    Depends on support from faculty.

16
Selling BCnot the WHAT, but the HOW
  • Rational Approach
  • The risk or probability of the event multiplied
    by the potential loss provides a suggested
    magnitude to the investment for protecting a
    university from disaster. Not many use this
    approach.
  • Peer Group Benchmarks
  • A very common and accepted approach is to
    compare the university against the market basket
    of peer institutions to see what they are doing.
  • Leverage the Crisis
  • The emotional side of living through a crisis
    tends to ease the flow of funds, so capture the
    opportunity when it arises.
  • Partnering with the Board and Audit Team
  • The Board has the ability to drive
    improvements. The External and Internal Audit
    Teams are agents of the Board and should be
    viewed as a partner, not a threat, as they are
    often viewed.

17
Risks of Complexity
Virtualization, distant centers, and split
operations add complexity, which has its own
attendant risks.
Standardization, documentation, and tight change
control help to reduce risks from complexity.
18
Factors Related to Distance
  • How far away is far enough for a second center?
  • GW has selected 34 miles
  • USC has designated a bunker just a few miles
    away
  • Others are saying 70 miles.
  • It really depends
  • You need to consider the types of risks in your
    region.
  • The greater the distance
  • The greater the cost or lesser the functionality
    and immediacy of response.
  • You may want to
  • Have a secondary high-availability or hot-site
    nearby and a tertiary cold-site much farther
    away.
  • You need to consider
  • The impacts on your staff and their ability to
    make it to the different sites both for routine
    maintenance as well as during a disaster
  • Some types of clustering do not work at a
    distance
  • Real-time mirroring is also adversely affected by
    distance.

19
Support those Blackberries
  • A critical element of the GW BC program is a
    network of Blackberries. All senior management
    at GW have them and use them everyday.
  • Blackberries are more like a laptop than a phone
    and require expert assistance
  • They have cell phone and radio capability
  • They can send and receive email and instant text
    messages
  • They have the ability to surf the web and access
    calendars, directories and online documents that
    can be used to support BC
  • We have a dedicated expert with backup to provide
    support to the Blackberries and the command
    centers.

20
Doesnt it cost a great deal?
Cost
  • GW had a hot-site,
  • Costing several hundred thousand dollars per
    year.
  • Went to a high-availability 2nd site.
  • One-time cost about 1 million
  • The ongoing costs were not more than the previous
    base budget due to the reallocation of the funds
    from the hot-site contract.
  • Increase in base needed was
  • 136K/yr 1 million loaned at 6 over 10 years
  • To offset costs we are leasing excess space
  • We are recovering the incremental operating costs
    of the 2nd site.
  • More reliable service without large additional
    costs - A NO-BRAINER!

Expected Cost Curve
GW Cost Curve
Time to Restoration of Operations
A myth propagated by hot-site vendors is that the
cost of customer owned high-availability is
prohibitive
21
Partnerships
  • National Capital Regional Emergency Response
    Partnership
  • Emergency Response groups across the region
    coordinate efforts and share experiences
  • First Responder Access Card (FRAC)
  • Regional exercises
  • Information sharing with key groups
  • University Partnerships
  • Cost and resource sharing or exchange programs
  • Georgetown University GW back one another up
  • MAX (Mid-Atlantic Crossroads gigapop)
  • Vendor Partnerships
  • Have helped GW identify best practices and
    utilize new technology useful to BC.
  • Their support in a disaster can be critical

The FRAC helps to get approved personnel across
road-blocks and barriers.
22
Questions?
Dave Swartz
Write a Comment
User Comments (0)
About PowerShow.com