A Research Program in Reliable Adaptive Distributed Systems RADS presentation

About This Presentation

Transcript and Presenter's Notes

Title: A Research Program in Reliable Adaptive Distributed Systems RADS

1
A Research Program inReliable AdaptiveDistribute
d Systems (RADS)

Armando Fox, Michael Jordan, Randy Katz, George
Necula, David Patterson, Ion Stoica, Doug Tygar
University of California, Berkeleyand Stanford
University

2
Presentation Outline

Why We Need a New Approach to Networked Systems
New Design Philosophy for RADS
Applying the Philosophy Early Experience with
Specific Approaches
Approaches for Software and Hardware
Dependability
Approaches for Networking
Approaches for Security
Applying SLT to dependability problems
Elements of a unified Experimental Prototype
Summary and Conclusions

3
New Approach for RADS(Reliable Adaptive
Distributed Systems)

Dramatically improve the trustworthiness of
networked systems
Observe design observation points throughout
system
Analyze SLT as an enabling technology
Respond detect anomalous behavior vs. baseline
Learn use observations to modify responses to
future observations
Act
Reactive use control points in system for rapid
recovery if detect something wrong
Proactive/protective prophylactically act on
system to prevent predicted impending failure

4
Todays Systems are Too Brittle

Fragile, easily broken, yielding poor
trustworthiness (dependability and security).
Amazon Revenue 3.1B, Downtime Costs 600,000
per hour
Why? Overly focused on performance, performance,
and cost-performance
Systems based on fundamentally incorrect
assumptions
Humans are perfect
Software will eventually be bug free
Maintenance is free
People/HW/SW failures are facts, not problems
If a problem has no solution, it may not be a
problem, but a fact--not to be solved, but to be
coped with over time Shimon Peres (Peress
Law)

5
If Failure is Inevitable...then Design for Rapid
Adaptation

Encompasses rapid server recovery, network
rerouting, prophylactic/protective actions...
Blurs distinction between normal operation and
recovery
Elements of the solution
Programming paradigms for robust recovery
Crash-only software design for rapid server
recovery
Network protocols designed for rapid detection of
assertion violations
Instrumentation and SLT for online analysis,
anomaly detection, and diagnosis of failure
Recovery benchmarks to measure progress
What you cant measure, you cant improve
Collect real failure data to drive benchmarks

6
RADS Conceptual Architecture
User
Programming Abstractions For Roll-back (Necula
Operator
Prototype Applications E-voting,
Messaging, E-Mail, etc.
Benchmarks,Tools for Human Operators (Patterson)
Crash-Only Middleware Servers, System
OC Infrastructure (Fox)
SLT Services
Application- Specific Overlay Network
Online Statistical Learning Algorithms (Jordan)
PNE
PNE
Edge Network
Edge Network
Protocols Enabling Fast Detection Route
Recovery, Network OC Infrastructure (Katz,
Stoica)
Router
Router
CommodityInternet IP networks

Reduction to practice of online SLT and
observe/analyze/act infrastructure
Reusable embeddable components

7
Presentation Outline

Why We Need a New Approach to Networked Systems
New Design Philosophy for RADS
Applying the Philosophy Early Experience with
Specific Approaches
Approaches for Software and Hardware
Dependability
Approaches for Networking
Approaches for Security
Applying SLT to dependability problems
Elements of a unified Experimental Prototype
Summary and Conclusions

8
Crash-Only SoftwareDramatically Simplifying
Recovery

Since robust systems must be crash-safe, make
crashes the only supported form of
shutdown/restart
Software componentsexternal power switch is
independent ofmisbehaving component
Recovery becomes inexpensive/safe to try
Simplifies failure detection, since can be overly
aggressive
Simplifies recovery, since only 1 type of
recovery action and always safe to try
Idea if something looks anomalous, its probably
wrong
Can machine learning and statistical monitoring
approaches be applied during online operations?

9
Crash-Only SoftwarePractical to Build

refocus on JAGR, talk about relevance of
middleware
Case studies two crash-only state-storage
subsystems (for session state and durable state)
OK to crash any node at any time for any reason
Recovery is highly predictable, doesnt impact
online performance
Replication provides probabilistic durability
capacity during recovery
Access pattern of workload exploited for
consistency guarantees
9 activity state statistics monitored per
storage brick
Metrics compared against those of peer bricks
Basic idea Changes in workload tend to affect
all bricks equally
Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time
Anomaly in 6 or more (out of 9) metrics gt reboot
brick
Simple thresholding and substring-frequency used
to determine anomalous

10
Supporting Crash-Only in Middleware

Add observation control points to Java
application middleware
Observe capture paths taken through system by
user request
Analyze look for highly-unlikely anomalous
(therefore probably buggy) paths
Act micro-reboot suspected-faulty J2EE
components transparently to rest of system
Result fast recovery improves overall
performability
micro-reboot is 2-3 orders of magnitude faster
than full application reboot
Improves performability (total amount of work per
unit time in presence of faults)
Minimizes disruption to users of other
(non-faulty) parts of system

Fast, cheap uRBs statistical monitoring
provide a degree of application-generic failure
detection recovery
11
Crash-Oriented SoftwareSystematic Approach

Needed Systematic mechanism for determining when
micro-reboots are safe
Programming-language level support for rollback
and state tracking
Needed Better integration with SLT
Which clustering/analysis techniques best
correlate anomalous paths to particular observed
failure types? (current prototype uses very
simple data clustering techniques)
Are these techniques suitable for online use?
(current prototype does offline analysis)

12
Presentation Outline

Why We Need a New Approach to Networked Systems
New Design Philosophy for RADS
Applying the Philosophy Early Experience with
Specific Approaches
Approaches for Software and Hardware
Dependability
Approaches for Networking
Approaches for Security
Applying SLT to dependability problems
Elements of a unified Experimental Prototype
Summary and Conclusions

13
Research Challenges

No protection against DoS attacks
MS Blaster inflicted Internet packet loss gt 20
Routing protocols blindly believe routes
advertised by neighbors
BGP router misconfigurations
200-1200 prefixes affected every day
CWs (AS3561) misconfiguration caused an outage
for gt 5000 prefixes for 2 hours (April 2001)
Malicious routers huge potential threat
Drop packets and render a destination unreachable
Eavesdrop the traffic to a given destination
Impersonate the destination

14
Observe, Analyze, Act

Observe
Use multiple vantage points to monitor the
network
Design protocols whose behaviors can be verified
Analyze
Learn from protocol behavior
Identify bogus information
Act
Contain misbehaving components
Rise flags for network operators
Empower end-hosts (e.g., enable end-hosts to stop
unwanted packets in the network infrastructure)
End-hosts know better when under attack
(flashcrowds vs. DoS attacks)
End-hosts can react faster than infrastructure

sender
receiver
15
Case Study BGP (Listen Whisper)

Whisper
Use redundancy to check for route advertisements
consistency
Listen
Monitor TCP flow progress to detect reachability
problems
Results
Whisper reduce the region of Internet vulnerable
to an isolated adversary to 5
Scalable, implementation can handle 10 times
todays BGP load
Listen detect reachability problems
Probability of false positives 1
Vulnerable to port scans ? plan to use SLT

16
Programmable Network Elements
In-Port Classify Transform Out-Port
Edge Network
Edge Network
Router
Router
Commodity Internet IP networks

Enabling Technology
Edge network elements for IDS, firewall, traffic
shaping, etc.
Next generation exposed APIs for 3rd party
programming
Location for efficient network-level monitoring
and control
Observe rapid detection of route failure or
network attack
Act e.g., filter intrusions, quarantine
propagating worms
Avoid configuration and latest patch not
installed errors

17
Presentation Outline

Why We Need a New Approach to Networked Systems
New Design Philosophy for RADS
Applying the Philosophy Early Experience with
Specific Approaches
Approaches for Software and Hardware
Dependability
Approaches for Networking
Approaches for Security
Applying SLT to dependability problems
Elements of a unified Experimental Prototype
Summary and Conclusions

18
Research Challenge Self-sensing and Reactive
Systems

Internet scale attacks are fundamentally
different than host scale attacks
Traditional Intrusion Detection Systems (IDS)
have had some success with host scale attacks,
but also many false positives
Internet scale attacks offer opportunity (more
evidence of wide scale attack) but also more
challenge (integrating data from a large number
of disparate sources)

19
Observe, Analyze, Act

Observe what to monitor, how to monitor
Analyze Learning from patterns of messages (not
parsing their contents)
Act
How to exchange minimal information (in system
under attack)
rapidly evolving security protocols (for
resilience to attack)
Applications Worm detection, spam detection
Ultimate challenge beyond detection and into
response

20
Security of Networked SystemsTechnical Approach

Mechanisms to learn, share, repair against
potential threats to dependability
Strengthen assurance of shared information via
lightweight authentication and encryption
TESLA authentication system replaces public-key
crypto with lightweight symmetric encryption
uses time asymmetry to provide assurance
Messages initially encrypted, verification keys
revealed laterprevents attacker from using a
received key to forge messages
Variations provide instant authentication.
Athena system generate random instances of
secure protocols
Ultra-fast checking softwaremodel-checking
proof-theoretic techniques to verify protocols
against stated requirements
Intelligently generate most efficient secure
protocol satisfying requirements or a random
instance of a secure protocol satisfying a given
set of requirements
Apply for SLT systems to more quickly exchange
information

21
Presentation Outline

Why We Need a New Approach to Networked Systems
New Design Philosophy for RADS
Applying the Philosophy Early Experience with
Specific Approaches
Approaches for Software and Hardware
Dependability
Approaches for Networking
Approaches for Security
Applying SLT to dependability problems
Elements of a unified Experimental Prototype
Summary and Conclusions

22
Statistical Learning Theory

Toolbox for design/analysis of adaptive systems
Algorithms for classification, diagnosis,
prediction, novelty detection, outlier detection,
quantile estimation, density estimation, feature
selection, variable selection, response surface
optimization, sequential decision-making
Classification algorithms
Recent scaling breakthroughs 10K features,
millions of data points
Kernel machines functional analysis and convex
optimization
Generalized inner productsimilarities among data
point pairs
Defined for many data types
Classical linear statistical algorithms
kernelized for state-of-the-art nonlinear SLT
algorithms in many areas

23
Statistical Learning Theory

Novelty Detection Problem
Unlimited observations reflecting normal
activityYet few (or no) instances that reflect
an attack or a bug
E.g. intrusion detection, machine diagnostics
Second-order cone program a convex optimization
problem with an efficient solution method
Given cloud of data in a high-dimensional feature
space, place a boundary around these to guarantee
that only a small fraction falls outside
Basic problem---find a boundary that encloses a
desired fraction of the data, and is as tight as
possible
can be done using the generalized Chebyshev
inequality
using kernels, this is a convex problem

24
Example Statistical Bug-finding

Programs are buggy, yet many people use them
Instrument programs to take samples of program
state at runtime
Collect information over the Internet from many
users runs
Learn a statistical classifier based on
successful and failed runs, using feature
selection methods to pinpoint the bugs
Example finding a bug in Unix bc utility
2908 features instrumented
All top feature indicate indx being unusually
large in more_arrays subroutine
storage.c176 more_arrays() indx gt optopt
storage.c176 more_arrays() indx gt opterr
storage.c176 more_arrays() indx gt use_math
Indeed, array overrun bug in re-allocation
routine more_arrays() found to cause memory
corruption and sometimes an eventual crash

25
Example III Diagnosis

A probabilistic graphical model with 600 disease
nodes, 4000 finding nodes
Node probabilities p(f_i d) were assessed from
an expert (Shwe, et al., 1991)
Want to compute posteriors p(d_j f)
Is this tractable?

26
Case Study Medical Diagnosis

Symbolic complexity
symbolic expressions fill dozens of pages
would take years to compute
Numerical simplicity
Jaakkola and Jordan (1999) describe a variational
method based on convexity that computes
approximate posteriors in less than a second

27
Challenge for SLT

Challenge on-line versions of the best
algorithms have yet to be developed
update the learning systems state based on small
sets of data
Available for some kernelized problems
On-line versions of the best algorithms have yet
to be developed!

28
Presentation Outline

Why We Need a New Approach to Networked Systems
New Design Philosophy for RADS
Applying the Philosophy Early Experience with
Specific Approaches
Approaches for Software and Hardware
Dependability
Approaches for Networking
Approaches for Security
Applying SLT to dependability problems
Elements of a unified Experimental Prototype
Summary and Conclusions

29
System Prototype

Comprehensive system architecture
Reduction of SLT to practical software components
embedded within a distributed systems context
Exhibition of an architecture for dramatically
improving the reliability and security of
important systems through observation-coordination
-adaptation mechanisms.

30
Messaging as an Application

E-mail is now mission-critical application
Organizational storage capacity shifting from
financial data bases to email (email is fastest
growing storage)
Loss of email more critical to continuing
operation of organization than telephony (imagine
if govt had no email for a week)
Instant Messaging is now mission-critical
application
In a crisis, many communication schemes will be
used land-based telephony, cellular telephony,
instant messaging, email,
Coordination among first-responders during crisis
response in field (administrators operators)
Demands for dependability, resistance to attack,
establishment of trust among interacting entities
Despite attempts by hackers, terrorists,

31
Measuring Sucess

Build email/IM prototype using RADS design
principles and tools
Put realistic performance workload on prototype
Subject prototype to increasingly difficult
failure workloads and attack workloads
E.g., hardware failures, software failures,
operator failures, worms attacks, DDOS attacks,
Measure false positive rates, accuracy rates,
time to analyze failures, time to act,
performance impact of actions, availability of
prototype, performability of prototype,
Compare results to conventional email/IM systems
under similar performance, failure, and attack
workloads

32
Disaster Response Messaging Application
Active Adversary Service Attacks
DHS/Federal Network
Net Failure
Coalition Internet
Trust Relations
Allies Networks
Allies Networks
Allies Networks
Adversary
Net Failure
Allies Networks
Local Police, Fire, State Police
Adversary
Incident Reports Responder Locations GIS Data Etc.
Compromised Network With Embedded Adversaries
33
Presentation Outline

Why We Need a New Approach to Networked Systems
New Design Philosophy for RADS
Applying the Philosophy Early Experience with
Specific Approaches
Approaches for Software and Hardware
Dependability
Approaches for Networking
Approaches for Security
Applying SLT to dependability problems
Elements of a unified Experimental Prototype
Summary and Conclusions

34
Old Science vs. New Science

First 50 years of computer science
manually-engineered systems
lack of adaptability, robustness, and security
no concern with closing the loop with the
environment
Next 50 years of computer science
statistical learning systems throughout the
infrastructure
self-configuring, adaptive, sentient systems
perception, reasoning, decision-making cycle
systems are always recovering because of this
ongoing automatic and dynamic adaptation
New way to think about and design adaptive
systems
Makes continuous monitoring and reaction a
first-class goal
Provides point of leverage for applying SLT and
related techniques

35
Scientific Foundation For Self- Systems

New design principles and tools for systems that
continuously adjust their behavior in response to
analysis of online observations
New metrics and benchmarks for evaluating
self-adapting networked systems
Advances in Statistical Learning Theory to move
from offline to online analysis of large-scale
distributed systems

36
BACKUP SLIDES
37
Statistical Learning Theory

Super kernels combine heterogeneous data via
multiple kernels
Semidefinite programs, convex optimization
problems with efficient solutions involving
efficient decomposition techniques
Useful in fusing evidence at distributed nodes
Problems of interest require combined parameter
estimation and optimization
Response surface methodology building local
mappings from configurations to performance, and
suggesting gradient directions in configuration
space leading to performance improvements
Policy-gradient methods SLT algorithms that make
sequences of decisions, yielding a behavior or
policy successfully developed policies for
nonlinear control problems involving high degrees
of freedom

38
Statistical Machine Learning

Kernel methods
neural network heritage
convex optimization algorithms
kernels available for strings, trees, graphs,
vectors, etc.
state-of-the-art performance in many problem
domains
frequentist theoretical foundations
Graphical models
marriage of graph theory and probability theory
recursive algorithms on graphs
modular design
state-of-the-art performance in many problem
domains
Bayesian theoretical foundations

39
Self-Verifiable ProtocolsBGP Whisper

AS1 advertises its address prefix
Chooses a secrete key x, and sends y h(x)
h() well-known one-way hash function
Every router forwards y h(y)
AS4 performs consistency check (y1)3 (y2)3 ?
If yes, assume both routes are correct
If no, at least one rout is incorrect (but dont
know which) ?rise a flag

(AS1,AS2,y1h2(x))
(AS1,AS2,AS3,y1h3(x))
AS3
AS2
(AS1, y1h(x))
AS4
Chose secret key x
AS1
(AS1,AS3,y2h2(x))
(AS1, y2h(x))
AS3
40
Enabling TechnologyEdge Services by Network
Appliances

In-the-Network Processing the Computer IS THE
Network

41
Self-Verifiable ProtocolsStatus and Future Plans

Two examples
BGP verifications (Listen Whisper)
Can trigger alarms and contain malicious routers
Minimal changes to BGP incrementally deployable
(Listen)
Self-verifiable CSFQ
Per-flow isolation without maintaining per flow
state
Detect and contain malicious flows
Ultimate goal develop distributed system able to
self diagnose and self-repair
Eliminate faulty components
Minimum raise a flag in case of configurations
and attackers
Develop set of principles and techniques for
robust protocols

42
Enabling TechnologyProgrammable Networks

Problem
Common programming/control environment for
diverse network elements to realize full power of
inside the network services and applications
Approach
Software toolkit and VM architecture for PNEs,
with retargetable optimized backend for diverse
appliance-specific architectures
Current Focus
Network health monitoring, protocol interworking
and packet translation services, iSCSI processing
and performance enhancement, intrusion and worm
detection and quarantining
Potential Impact
Open framework for multi-platform appliances,
enabling third party service development
Provable application properties and invariants
avoidance of configuration and latest patch not
installed errors

43
Enabling TechnologyProgrammable Networks

Generalized PNE programming and control model
Generalized virtual machine model for this
class of devices
Retargetable for different underlying
implementations
Edge services of interest
Network measurement and monitoring supporting
model formation and statistical anomaly detection
Framework for inside-the-network protocol
listening
Selective blocking/filtering/quarantining of
traffic
Application-specific routing
Faster detection and recovery from routing
failures than is possible from existing Internet
protocols
Implementation of self-verifiable protocols

44
Crash-Only Statistical Monitoring Resilience
to Real-World Transients

Simple fault model observed anomalies coerced
into crash faults
Surprise! Statisticalmonitoring catches
manyreal-world faults, withouta pre-established
baseline

45
Self-Verifiable ProtocolsStatement of the
Problem

Problem Detect and contain network effects of
misconfigurations and faulty/malicious components
Approach design network protocols so each
component verifies correct behavior of the
protocol
Examples
e2e protocols
routing (BGP) protocols

46
Self-Verifiable ProtocolsCase Study BGP

Propagating invalid BGP routes can bring the
Internet down
Multiple causes
Router misconfigurations happen daily, yielding
outages lasting hours
Malicious routers huge potential threat
Routers with default passwords
Possible to buy routers passwords on darknets
Existing solutions
Hard to deploy (e.g., Secure-BGP), or
insufficient security
Our solution
Whisper verify the correctness of router
advertisements
Listen verify the reachability on the data plane

47
Self-Verifiable ProtocolsBGP Whisper

Use redundancy to check consistency of peers
information
Whisper game
Group sits in a circle., person whispers secret
phrase to neighbors
Person at other end concludes
Phrase is correct if same phrase from both
neighbors
Otherwise, at least one phrase is incorrect

48
Self-Verifiable ProtocolsBGP Listen

Monitor progress of TCP flows
If TCP flow doesnt make progress, might be
because route is incorrect
Use heuristics to reduce number of false
positives and negatives
Still difficult to handle traffic patterns like
port scanners
Use SLT techniques to improve the detection
accuracy?

49
Military Messaging Application
Active Adversary Service Attacks
US Forces Network
Net Failure
Coalition Internet
Trust Relations
SitReps
Allies Networks
Allies Networks
Allies Networks
Adversary
Net Failure
Allies Networks
Allies Networks
Adversary
Compromised Network With Embedded Adversaries

Write a Comment

User Comments (0)

About PowerShow.com

A Research Program in Reliable Adaptive Distributed Systems RADS PowerPoint PPT Presentation