Software Testing Doesn - PowerPoint PPT Presentation

About This Presentation
Title:

Software Testing Doesn

Description:

No panic midnight calls to admins. Mask failures rather than futile attempt to avoid ... Bring down node for mem-based data structure faults ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 15
Provided by: sqlm1
Category:

less

Transcript and Presenter's Notes

Title: Software Testing Doesn


1
Software Testing Doesnt Scale
  • James Hamilton
  • JamesRH_at_microsoft.com
  • Microsoft SQL Server

2
Overview
  • The Problem
  • S/W size complexity inevitable
  • Short cycles reduce S/W reliability
  • S/W testing is the real issue
  • Testing doesnt scale
  • trading complexity for quality
  • Cluster-based solution
  • The Inktomi lesson
  • Shared-nothing cluster architecture
  • Redundant data metadata
  • Fault isolation domains

3
S/W Size Complexity Inevitable
  • Successful S/W products grow large
  • features used by a given user small
  • But union of per-user features sets is huge
  • Reality of commodity, high volume S/W
  • Large feature sets
  • Same trend as consumer electronics
  • Example mid-tier server-side S/W stack
  • SAP 47 mloc
  • DB 2 mloc
  • NT 50 mloc
  • Testing all feature interactions impossible

4
Short Cycles Reduce S/W Reliability
  • Reliable TP systems typically evolve slowly
    conservatively
  • Modern ERP systems can go through 6 minor
    revisions/year
  • Many e-commerce sites change even faster
  • Fast revisions a competitive advantage
  • Current testing and release methodology
  • As much testing as dev time
  • Significant additional beta-cycle time
  • Unacceptable choice
  • reliable but slow evolving or fast changing yet
    unstable and brittle

5
Testing the Real Issue
  • 15 yrs ago test teams tiny fraction of dev group
  • Now tests teams of similar size as dev growing
    rapidly
  • Current test methodology improving incrementally
  • Random grammar driven test case generation
  • Fault injection
  • Code path coverage tools
  • Testing remains effective at feature testing
  • Ineffective at finding inter-feature interactions
  • Only a tiny fraction of Heisenbugs found in
    testing (www.research.microsoft.com/gray/Talks/IS
    AT_Gray_FT_Avialiability_talk.ppt)
  • Beta testing because test known to be inadequate
  • Test team growth scales exponentially with system
    complexity
  • Test and beta cycles already intolerably long

6
The Inktomi Lesson
  • Inktomi web search engine (SIGMOD98)
  • Quickly evolving software
  • Memory leaks, race conditions, etc. considered
    normal
  • Dont attempt to test beta until quality high
  • System availability of paramount importance
  • Individual node availability unimportant
  • Shared nothing cluster
  • Exploit ability to fail individual nodes
  • Automatic reboots avoid memory leaks
  • Automatic restart of failed nodes
  • Fail fast fail restart when redundant checks
    fail
  • Replace failed hardware weekly (mostly disks)
  • Dark machine room
  • No panic midnight calls to admins
  • Mask failures rather than futile attempt to avoid

7
Apply to High Value TP Data?
  • Inktomi model
  • Scales to 100s of nodes
  • S/W evolves quickly
  • Low testing costs and no-beta requirement
  • Exploits ability to lose individual node without
    impacting system availability
  • Ability to temporarily lose some data W/O
    significantly impacting query quality
  • Cant loose data availability in most TP systems
  • Redundant data allows node loss w/o data
    availability lost
  • Inktomi model with redundant data metadata a
    solution to exploding test problem

8
Connection Model/Architecture
Client
Server Node
  • All data metadata multiply redundant
  • Shared nothing
  • Single system image
  • Symmetric server nodes
  • Any client connects to any server
  • All nodes SAN-connected

Server Cloud
9
Compilation Execution Model
Client
Server Thread Lex analyze Parse Normalize Optimize
Code generate
Server Cloud
10
Node Loss/Rejoin
Client
  • Execution in progress
  • Rejoin.
  • Node local recovery
  • Rejoin cluster
  • Recover global data at rejoining node
  • Rejoin cluster

Server Cloud
11
Redundant Data Update Model
Client
  • Updates are standard parallel plans
  • Optimizer knows all redundant data paths
  • Generated plan updates all
  • No significant new technology
  • Like materialized view index updates today

Server Cloud
12
Fault Isolation Domains
  • Trade single-node perf for redundant data checks
  • Fairly commonbut complex error recovery is even
    more likely to be wrong than original forward
    processing code
  • Many of the best redundant checks are compiled
    out of retail versions when shipped (when
    needed most)
  • Fail fast rather than attempting to repair
  • Bring down node for mem-based data structure
    faults
  • Never patch inconsistent dataother copies keep
    system available
  • If anything goes wrong fire the node and
    continue
  • Attempt node restart
  • Auto-reinstall O/S, DB and recreate DB partition
  • Mark node dead for later replacement

13
Summary
  • 100 MLOC of server-side code and growing
  • Cant fight it cant test it
  • quality will continue to decline if we dont do
    something different
  • Cant afford 2 to 3 year dev cycle
  • 60s large system mentality still prevails
  • Optimizing precious machine resources is false
    economy
  • Continuing focus on single-system perf dead
    wrong
  • Scalability system perf rather than individual
    node performance
  • Why are we still incrementally attacking an
    exponential problem?
  • Any reasonable alternatives to clusters?

14
Software Testing Doesnt Scale
  • James Hamilton
  • JamesRH_at_microsoft.com
  • Microsoft SQL Server
Write a Comment
User Comments (0)
About PowerShow.com