Berkeley RAD Lab Technical Vision - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Berkeley RAD Lab Technical Vision

Description:

... (S. Kawamoto) as low-cost prevention/repair strategies ... Root Cause: High DNS request rates generated by Spam Appliance triggered by mail surge ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 28
Provided by: georgep6
Category:

less

Transcript and Presenter's Notes

Title: Berkeley RAD Lab Technical Vision


1
Berkeley RAD LabTechnical Vision
  • Armando Fox, Randy Katz, Michael Jordan, Dave
    Patterson, Scott Shenker, Ion Stoica
  • RADS Retreat, June 2005

2
Outline
  • Overall Vision
  • Internet Services Vision (ServRADS)
  • Network Vision (NetRADS)
  • Internet Services Network architecture
  • Principles and Summary

3
Overarching Mantra
  • Enable a faster pace of network service
    innovationthrough new distributed system
    architectures that reduce operations cost by
    2-3 orders of magnitude
  • The Challenge
  • Software systems Too much information gt make
    sense of it through statistical learning
    control theory
  • Network systems Too little information gt
    exploit better observation and monitoring in the
    network infrastructure to drive management
    processes

4
In practice this means
  • Single person can write, deploy, operate the
    next-generation IT business (the Fortune 1
    million)
  • Do for Internet apps what Web did for individual
    publishing
  • Gray s challenge planetary-scale distributed
    system operated by a single part-time operator
  • Goal programmers focus on functionality put
    the ility in the platform
  • Could be built on utility computing, giving
    access to distributed physical resources
  • Integrated approach to network and server/service
    management
  • Requires 100x-1000x reduction in TCO from todays
    levels

5
What things are like today
  • World-scale services created and operated by
    expert teams
  • Google-sized organization to create a Google
  • Amazons book browsing, designed by programmers,
    is cumbersome
  • Browsing for housewares, designed by domain
    experts on mature infrastructure, more usable
  • We dont know what the next killer app will be!
  • NOW project didnt predict Internet search as a
    Killer app for NOWs
  • If we succeed, the next killer Internet app will
    be written, deployed, operated, at Google-like
    scales, by a single programmer

6
Focusing on lowering cost of ownership
  • Standard way to account for where the money
    goes in operating a deployed distributed
    application
  • Definition independent of who is operating the
    app
  • Operators per byte of storage or per CPU? No,
    doesnt scale with technology changes
  • Operators per end-user served? (This is the
    figure of merit for e-tailers)
  • Operators per geographic region served?
  • Operators per spent on capital cost?
  • Operators per of revenue?

7
Outline
  • Overall Vision
  • Internet Services Vision (ServRADS)
  • Network Vision (NetRADS)
  • Internet Services Network architecture
  • Principles and Summary

8
Enabling Technologies for Reducing TCO in ServRADS
  • Past successes
  • microrebooting Fast recovery makes false
    positives tolerable
  • Pinpoint using SLT to detect and localize
    fine-grain failures
  • visualizationSLT to help operators earn their
    trust
  • Elements of technical vision
  • SLT and machine learning
  • Operator-centric visualization
  • Control theory
  • Open source failures database (sanitized, open
    failures forensics repository)

9
Example scenarios
  • Helping operators make sense of instrumentation
  • Using ML techniques to localize failures (P.
    Bodik, E. Kiciman)
  • Using automatically-induced statistical models to
    identify likely causes of performance problems
    (S. Zhang, I. Cohen et al.)
  • Combining SLT with visualization for
    cross-checking problem reports and rapidly
    spotting potential problems visually
  • Facilitating self-tuning/configuration
  • Using control theory to improve performance of a
    distributed streaming database (W. Xu)
  • Service placement in wide-area distributed system
    (D. Oppenheimer)
  • Microreboots (G. Candea) and microreplacement (S.
    Kawamoto) as low-cost prevention/repair
    strategies
  • If false positive cost can be kept low, automate.
    Otherwise, help operator do her job.

10
Services example combining viz SLT
11
Reduce TCO via Planetary-scale Abstractions
  • Inspiration narrowly-focused planetary-scale
    abstractions whose design implementation...
  • scale well understand distributed scheduling,
    locality, symptoms of wide-area failures
  • monitorable and controllable (using SLT linear
    CT)
  • retain precisely-quantifiable and acceptable
    semantics under partial-failure conditions
  • Examples of existing narrow but powerful
    services
  • MapReduce in Google understands data locality
  • Can easily imagine a lossy MapReduce, like
    online aggregation
  • queues/messaging in Yahoo, Amazon, others
  • User information database in Yahoo
  • Instrumentation collection analysis services
    using Telegraph-CQ

12
Outline
  • Overall Vision
  • Internet Services Vision (ServRADS)
  • Network Vision (NetRADS)
  • Internet Services Network architecture
  • Principles and Summary

13
RADS Network Problem
  • Internet routing has proven to be robust
  • But
  • Poor visibility hard to determine health of the
    network
  • Routing policy interactions defeat propagation of
    useful diagnostic info difficult to identify
    root cause problems
  • Slow reaction times to connectivity failures
    operator intervention (across admin domains)
    increases cost of ownership
  • Key observation network service failures
    attributed to unexpected traffic surge patterns
  • Approach identify and protect good traffic
    during surge
  • Mechanism deployed in network edge
  • Its where the servers and clients are located
  • Greatest need for lowering management costs
  • Administrative scope and responsibility is
    well-defined

14
iBoxes New network element for Observe,
Analyze, Act
Enterprise Network Architecture
Inspection-and-Action Boxes Deep multiprotocol
packet inspection No routing observation
marking Policing points drop, fence, block
15
Network-Level Observe-Analyze-Act
  • Observe
  • Packet, path, protocol, service invocation
    statistical collection and sampling frequencies,
    latencies, completion rates
  • Construct the collection infrastructure
  • Analyze
  • Determine correlations among observations
  • Normal model discovery anomaly detection
  • Exploit SLT
  • Act
  • Experiment to test correlations
  • Prioritize and throttle
  • Mark and annotate
  • Control theory? Distributed analyses and actions

16
Network Layer Mechanism Annotations
  • Enhance network visibility disseminate
    observations, communicate actions, provide
    in-band network management actions, iBox-to-iBox
    communications
  • iBoxes label packets at annotation layer but do
    not rewrite packet contents
  • Annotations stack, must be removed from packets
    before delivery to A-layer unaware end nodes

17
Scenario Traffic Surge Inhibiting Network
Services
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S
  • DNS Server swamped by excessive request traffic
  • Observe DNS time outs, Web access traffic
    slowed, but also higher than normal mail delivery
    latency implying busy server edge (correlation
    between Mail Server and DNS Server utilization?)
  • Root Cause High DNS request rates generated by
    Spam Appliance triggered by mail surge

18
Scenario
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S
  • How Diagnosed?
  • I-S detects high link utilization but abnormally
    high DNS traffic
  • Stats from I-I high mail traffic, low outgoing
    web traffic, in traffic high but link utilization
    not high
  • Stats from I-A lower web traffic, no unusual
    mail origination
  • Problem localized to Server edge, but visibility
    limited RADS can help

19
Scenario
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S
  • Possible Action Responses
  • Experiment Redirect local DNS requests to
    Secondary DNS server if these complete, can
    infer the server is the problem, not the network
  • Throttle Due to MS-DNS correlation, block/slow
    email traffic at Server Edge should expect
    reduced DNS server utilization

20
Outline
  • Overall Vision
  • Internet Services Vision (ServRADS)
  • Network Vision (NetRADS)
  • Internet Services Network architecture
  • Principles and Summary

21
Embodying principles in a prototype
  • Platform architecture and prototype to enable
    rapid innovation in network services by
    non-experts
  • automatically accommodates scaling, provisioning,
    failure management
  • multi-datacenter (geoplexed)
  • observable networks connecting datacenters
  • potentially planetary scale
  • runs with minimal operator oversight
  • Prototype keeps various research projects focused
    on common goal and allows ongoing testing
  • Participation in standards processes to promote
    best practices in platform as open standards

22
Reliable Adaptive Distributed Systems
Operator
User
Prototype Applications
Programming Abstractions For Roll-back
and wide-area distributed computations
SLT Services
Crash-only services Observation Infrastructure
forSystem SLT
Application- Specific Overlay Network
Checkable Protocols Fast Detection Route
Recovery ObservationInfrastructure for network
SLT
iBox
iBox
Edge Network
Edge Network
Commodity Internet
23
Generic iBox Architecture
Tag Mem
Rules Programs
24
Possible architecture of a rack
app. server application, e.g. J2EE
Microrecovery actions
Datacenter boundary
From other datacenters
High-leveleffectors
SLTalgo.
SLTalgo.
SLTalgo.
To other datacenters
Control loops
High-level sensor data
Externally-inducedfailures, workload changes,
etc.
T-CQ engine
Sanitizeddata
Visualization
SLTalgo.
SLTalgo.
SLTalgo.
Preprocesseddata
Syndrome identification
To otherdatacenters
25
Outline
  • Overall Vision
  • Internet Services Vision (ServRADS)
  • Network Vision (NetRADS)
  • Internet Services Network architecture
  • Principles and Summary

26
ServRADS Observations Summary
  • SLT algorithms make sense of large amounts of
    data
  • Classification, outlier/anomaly detection,
    clustering, etc.
  • Viz helps operator use visual pattern
    recognition to quickly spot problems and
    cross-check SLT models
  • Enables operator expertise to be quickly brought
    to bear
  • Builds operators trust in statistical/machine
    learning models
  • Challenge
  • Fundamental challenges associated with applying
    SLT to problem determination (coming up next
    session)
  • Unifying many techniques into a coherent approach
    - prototype platform as unifying artifact
  • Idea capture best practices in TCO-optimized,
    planetary-scale abstractions

27
NetRADS Observations Summary
  • COPS Paradigm for (more) automatically
    protecting critical resources when network is
    under stress
  • Checkable protocols visible semantics
  • Observe network behavior good (easy), bad
    (hard), suspicious
  • Protect services throttle, redirect
  • Network management major contributor to TCO
  • NetRADS built on
  • iBoxes pervasive infrastructure for observation
    and action at the network level
  • Annotation Layer for marking, control,
    inter-iBox communications
  • Integration with Internet service approach for
    service/server-level visibility and integrated
    management
Write a Comment
User Comments (0)
About PowerShow.com