Microreboot A Technique for Cheap Recovery - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Microreboot A Technique for Cheap Recovery

Description:

Pro & Con of Reboot. Microreboot. General conditions for microreboot. Gains from ... static presentation data. GIFs, HTML, JSPs, etc. Ext3FS filesystem. Back ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 45
Provided by: sunj2
Category:

less

Transcript and Presenter's Notes

Title: Microreboot A Technique for Cheap Recovery


1
Microreboot A Technique for Cheap Recovery
  • George Candea, ..., Armando Fox

2
1. Introduction
  • Software bugs
  • Pro Con of Reboot
  • Microreboot
  • General conditions for microreboot
  • Gains from microreboot

3
2. Designing Microrebootable Software
  • The character of workloads faced by Internet
    service
  • Three Design goals
  • fast and correct component recovery
  • strongly-localized recovery
  • fast and correct reintegration of recovered
    components
  • Crash-only design approach
  • The complete separation of data recovery from
    application recovery

4
3. A Microrebootable Prototype
  • Prototype based on J2EE AS JBoss and RUBiS
  • Microreboot Machinery
  • kill EJB Component, associated thread, resources,
    metadata
  • reserved ejb classloader
  • A Crash-Only Application
  • State segregation
  • Isolation and decoupling

5
4. Evaluation Framework
  • a client emulator
  • a fault injector
  • a system for automated failure detection,
    diagnosis, and recovery

6
4.1 client emulator
7
4.2 fault injector
  • J2EE systems suffer from the following categories
    of software-related failures
  • accidental use of null references (e.g., during
    exception handling) that result in
    NullPointerException
  • hung threads due to deadlocks, interminable
    waits, etc.
  • bug-induced corruption of volatile metadata
  • leak-induced resource exhaustion
  • various other Java exceptions and errors that are
    not handled correctly
  • used both FIG and FAUmachine (under JVM)
  • memory and register bit flips
  • disk block errors
  • network packet drops
  • erroneus returns from system calls for memory
    allocation and input/output.

8
4.3 failure detection diagnosis recovery
  • failure detection in the client emulator
  • recovery manager (RM)
  • action-weighted throughput (Taw)

9
5. Evaluation Results
  • Are microreboots effective in recovering from
    failures
  • Are microreboots any better than JVM restarts
  • Are microreboots useful in clusters
  • Do microreboot-friendly architectures incur a
    performance overhead

10
5.1 Effective in recovering from failures
11
Table continued
12
Table continued
13
5.2 better than JVM restarts
14
5.2 Continued
  • At t10 min, corrupt the transaction method map
    for EntityGroup, the EJB recovery group that
    takes the longest to recover.
  • At t20 min, corrupt the JNDI entry for
    RegisterNewUser, the next-slowest in recovery
  • At t30 min, inject a transient exception in
    BrowseCategories, the entry point for all
    browsing (thus, the most-frequently called EJB in
    our workload)
  • Overall, 11,752 requests (3,101 actions) failed
    when recovering with a process restart, shown in
    the top graph 233 requests (34 actions) failed
    when recovering by microrebooting one or more
    EJBs. Thus, the average is 3,917 failed requests
    (1,034 actions) per process restart, and 78
    failed requests (11 actions) per microreboot of
    one or more EJBs.

15
5.2 Continued
  • Microreboots recover faster
  • recovery time distribution
  • Microreboots reduce functional disruption
  • Microreboots reduce lost work
  • session state lost during recovery(due to FastS)
  • used SSM, overall good Taw lower
  • microreboots(use FastS)allowed the system to both
    preserve session state across recovery and avoid
    cross-JVM access penalties

16
5.3 Useful in Clusters
  • a cluster of 8 independent application server
    nodes
  • using a client-side load balancer LB
  • failover under normal load
  • microreboots preserve cluster load dynamics

17
5.4 Performance Impact
18
6. A New Approach to Failure Management
  • Alternative Failover Schemes
  • microreboot without failover improves
    user-perceived availability over failover and
    microreboot
  • User-Transparent Recovery
  • Tolerating Lax Failure Detection
  • Averting Failure with Microrejuvenation
  • resource leaks are a major problem for many
    large-scale Java applications

19
7. Limitations of Recovery by Microreboot
  • Impact on shared state
  • Interaction with external resources
  • Delaying a full reboot

20
8. Generalizing beyond Prototype
  • Biggest challenges
  • extricating session state handling from
    application logic
  • ensuring that persistent state is updated with
    transactions
  • microreboot systems design aspects
  • Isolation
  • Workload
  • Resources

21
Three-Tiered Architecture
22
EJB Container
23
Software bugs
  • Bugs are hard to be eradicated, tracked down,
    resolved and fixed at the time of failure.
  • It is mostly application-level failures that
    bring down enterprise-scale software.
  • Many failures can be successfully recovered by
    rebooting, even when the failure's root cause is
    unknown.
  • Back

24
Pro Con of Reboot
  • high-confidence way to reclaim stale or leaked
    resources
  • not rely on the correct functioning of the
    rebooted system
  • easy to implement and automate
  • return the software to its start state
  • Unexpected reboots can result in data loss and
    unpredictable recovery times
  • Back

25
Microreboot
  • Individual rebooting of fine-grain application
    components
  • The same benefits as whole-process restarts
  • An order of magnitude faster and less lost work
  • Data recovery is completely separated from
    (reboot-based) application recovery
  • Back

26
General conditions for microreboot
  • well-isolated
  • stateless components
  • keep all important application state in
    specialized state stores
  • Back

27
Gains from microreboot
  • Can be attempted first
  • In multi-node clusters, a microreboot may be
    preferable even over node failover
  • To rejuvenate a system by parts without shutting
    down
  • Transparent call-level retries to mask a
    microreboot from end users
  • Back

28
Crash-only design approach
  • programs that can be safely crashed in whole or
    by parts and recover quickly every time
  • main points of our crash-only design approach
  • Fine-grain components
  • State segregation
  • Back

29
complete separation
  • shifts the burden of data management from the
    often-inexperienced application writers to the
    specialists who develop state stores.
  • conditions
  • Decoupling
  • Retryable requests
  • Leases
  • Back

30
State segregation
  • Persistent state
  • MySQL(132K items, 1.5M bids, 10K users)
  • Session state
  • FastS in-memory repository inside JBoss
  • SSM maintains state on separate machines
  • static presentation data
  • GIFs, HTML, JSPs, etc.
  • Ext3FS filesystem
  • Back

31
failure detector in the client emulator
  • detect a service's user-visible failures
  • detector to check if a client encounters a
    network-level error
  • detector to flag errors by comparing results
  • Back

32
recovery manager (RM)
  • performs simple failure diagnosis and recovers
  • microrebooting EJBs, the WAR ? all of eBid ? JVM
    ? rebooting the operating system
  • simple recursive recovery policy trying the
    cheapest recovery first
  • Back

33
action-weighted throughput (Taw)
  • session ? action action ? operation
  • action succeeds or fails atomically
  • all operations succeed, count toward good Taw
  • an operations failed, all count toward bad Taw
  • both long-running and short-running operations
    must succeed for a user to be happy with the
    service
  • when an action with many operations succeeds, it
    generally means the user did more work than in a
    short action
  • Back

34
Microreboots recover faster
35
Continued
Back
36
recovery time distribution
Back
37
Microreboots reduce functional disruption
Back
38
Failover under normal load
recovering with JVM restart, on average 2,280
requests failed in the case of microrebooting,
162 requests failed Back
39
microreboots preserve cluster load dynamics
40
Continued
  • requests which response times exceeding 8 seconds

Back
41
User-Transparent Recovery
Back
42
Tolerating Lax Failure Detection
  • Tdet the time to detect the failure
  • FPdet false positive rate
  • FNdet false negative rate
  • Cheap recovery relaxes the task of failure
    detection
  • allows for longer Tdet
  • reduces the cost of a false positive

43
continued
Back
44
Averting Failure with Microrejuvenation
  • Available memory during microrejuvenation. Inject
    a 2 KB/invocation leak in Item and a 250
    KB/invocation leak in ViewItem. Malarm is set to
    35 of the 1-GByte heap (thus 350 MB) and
    Msufficient to 80 (800 MB).
  • Back
Write a Comment
User Comments (0)
About PowerShow.com