Ten Minutes on Five Nines - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Ten Minutes on Five Nines

Description:

University of Washington. Computing & Communications. Ten Minutes on Five Nines. Terry Gray ... Terry's 'Works fine, lasts a long time' Low ROI (Risk Of ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 15
Provided by: TEG
Category:
Tags: five | minutes | nines | ten | terry

less

Transcript and Presenter's Notes

Title: Ten Minutes on Five Nines


1
Ten Minutes on Five Nines
  • Terry Gray
  • Associate VP, IT Infrastructure
  • University of Washington
  • lta last minute recruit forgt
  • Common PROBLEMS Group
  • 6 January 2005

2
Vision
  • Systems/Services (and Staff!) characterized as
    Reliable and Responsive
  • Reliability job one
  • But I.T. Inevitable Tensions
  • We all want
  • High MTTF, Performance and Function
  • Low MTTR and support cost
  • The art is to balance those conflicting goals
  • we are jugglers and technology actuaries

3
Success Metrics
  • Toms
  • Nobody gets hurt
  • Nobody goes to jail
  • Terrys
  • Works fine, lasts a long time
  • Low ROI (Risk Of Interruption)

4
Design Tradeoffs
  • Fault Zone size vs. Economy/Simplicity
  • Reliability vs. Complexity
  • Prevention vs. (Fast) Remediation
  • Security vs. Supportability vs. Functionality
  • Networks Connectivity Security Isolation
  • Balancing priorities (security vs. ops vs.
    function)

5
Context A Perfect Storm
  • Increased dependency on I.T.
  • Decreased tolerance for outages
  • Deferred maintenance
  • Inadequate infrastructure investment
  • Some extraordinarily fragile applications
  • Fragmented host management
  • Increasingly hostile network environment
  • esp. spam, spyware, social engr attacks
  • Increasing legal/regulatory liability
  • Highly de-centralized culture
  • Growth of portable devices

6
System Elements
  • Environmentals (Power, A/C, Physical Security)
  • Network
  • Client Workstations (incl. portable devices)
  • Servers
  • Applications
  • Personnel, Procedures, Policy, and
    ArchitectureFailures at one level can trigger
    problems at another level need Total System
    perspective

7
Dimensions
  • How often is there a user-visible failure?
  • How many people are affected?
  • For how long?
  • How severely?

8
Basics
  • How many nines?
  • Problem one what to measure?
  • How do you reduce behavior of a complex net to a
    single number?
  • Difficult for either uptime or utilization
    metrics
  • Problem two data networks are not like phone or
    power services
  • Imagine if phones could assume anyones number
  • Or place a million calls per second!

9
Security vs. Reliability
  • Obviously lack of security is bad but
  • Defense in depth is not free
  • Each addl defensive perimeter increases MTTR
  • Defense-in-depth conjecture (for N layers)
  • Security MTTE (exploit) ? N2
  • Functionality MTTI (innovation) ? N2
  • Supportability MTTR (repair) ? N2
  • Next-gen threats firewalls wont help

10
Complexity vs. Reliability
  • How do you measure avail in complex systems?
  • Death of the Network Utility Model
  • Organizational vs. geographic networking
  • SAN virtualization
  • Web load-leveler appliances
  • Organizational boundary conditions
  • Networks from stochastic to non-deterministic
  • Subnets with clients and critical servers
  • Documentation deficiencies

11
Complex System Failures Inevitable?
  • Jan 2004 (?) IEEE Spectrum on Power Grid
    failures
  • Point it will happen, so plan for mitigation

12
Work in Progress
  • New trouble-ticket system
  • New network management system
  • Next-generation network architecture
  • Next-generation security architecture
  • Improving change control process
  • Improving DRBR process
  • Lots of work on improving mon/diag tools

13
In Short
  • Expectations are growing (unrealistically?)
  • Complexity is growing
  • Few are prepared to pay for true HA
  • Cultural barriers to change control
  • Hospitals are a whole other world
  • Biggest SPoF power/HVAC
  • Organizational complexity undermines HA
  • Both security and lack of it undermine HA
  • Redundancy can mask failures too well!
  • With redundancy, must have better tools
  • Need Ops-centric design, better DRBR
  • Need application procurement standards

14
Questions? Comments?
Write a Comment
User Comments (0)
About PowerShow.com