Title: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms
1FIG A Prototype Tool for On-Line Verification of
Recovery Mechanisms
- Naveen Sastry, Pete Broadwell,Jonathan Traupman,
David Patterson - University of California, Berkeley
2 Presentation Outline
- Introduction
- Objective/Motivation
- Background
- Methods
- Implementation
- Test setup
- Evaluation
- Test results
- Conclusions
3The Berkeley/Stanford ROC Project
- Purpose investigating novel techniques for
building highly-dependable Internet services - Example techniques
- Advanced support for operator undo
- Stability through targeted restarts
- Integrated root cause analysis
- Online verification of recovery mechanisms
4FIG Project Objective/Motivation
- Objective
- Develop a lightweight, extensible tool for
injecting errors to test recovery code/mechanisms - Motivation
- Testing and production environments are always
different - Large systems will require recovery code, which
should be tested as part of normal operation
5Softwares Invisible Users
User Input
User interface
Application
Other libraries
Other apps
System libraries (libc)
OS
Concept Jim Whittaker Florida Institute of
Technology
6Related Testing Methods
- Ballista (DeVale, Koopman, Siewiorek)
- Top-down testing of POSIX-compliant OS and
library interfaces - Fuzz (Miller, Fredriksen, So)
- Tested UNIX applications by feeding them random
input streams - Holodeck (Whittaker et al.)
- Similar approach to ours, but only for Windows
2000/XP
7FIG Implementation
- Thin stub library between app libraries
- Traps API calls
- Logs them
- Inserts faults
- Can be inserted into any app without modification
- Uses LD_PRELOAD
Application
libfig.so
libc.so, other libs
OS
8Extensibility
- API stubs are automatically generated
- Very easy to add new APIs to log
- Fault injection is under script control
- Can simulate multiple fault models (e.g., memory
pressure)
Sample control file
- MALLOC_INDEX
- interval 82 to infinity return 0
- errno ENOMEM probability 0.03
- OPEN_INDEX
- // device out of space.
- interval 100 to infinity return
- 1 errno ENOSPC probability 0.001
- // kernel out of memory.
- interval 100 to 120 return 1
- errno ENOMEM probability 0.1
- // too many files open.
- callnumber 108 return -1 errno EMFILE
- probability 1.0
9Test Setup Applications
- GNU file utilities (ls, mv, etc.)
- Emacs 20.7.1 with and without X
- Apache 1.3.22
- Berkeley DB 4.0.14
- Netscape Navigator 4.76
- MySQL server 3.23.36
10Test SetupInstrumented Calls Their Errors
- malloc() memory exhaustion
- read() I/O error, system call was
interrupted - write() I/O error, no space left on
device, call interrupted - open() memory exhaustion, no space
on device, too many files open - select() memory exhaustion
11Test Results Client Apps
read() read() write() write() select() malloc()
EINTR EIO ENOSPC EIO ENOMEM ENOMEM
Emacs no X o.k. exit warn warn o.k. crash
Emacs -w/X o.k. crash o.k. crash crash/exit crash
Netscape warn exit exit exit n/a exit
12Test Results Server Apps
read() read() write() write() select() malloc()
EINTR EIO ENOSPC EIO ENOMEM ENOMEM
Berkeley DB Xact retry detect Xact abort Xact abort n/a Xact abort
Berkeley DB no Xact retry detect data loss data loss n/a detect, or data loss
MySQL Server Xact abort retry, warn Xact abort Xact abort retry restart process
Apache o.k. req. drop req. drop req. drop o.k. n/a
13Netscape Reacts
14Test Results Overhead
Time (s) Overhead
No FIG 33.46 N/A
FIG, no logging 34.28 2.5
Logging w/o timestamps 47.83 42.9
Logging w/timestamps 61.74 84.5
strace (all syscalls) 112.85 237.3
Timing using Berkeley DB (non-transactional) to
read, sort and write one million words.
- Note FIG communicates with a separate logging
daemon through shared memory to reduce logging
overhead.
15Strategies forReliable Services
- Intelligent retry
- ls bounded retry of malloc()
- Resource preallocation
- Apache allocates buffer pool at startup
- Degraded service
- Apache deactivates logging if disk full
- Process pools
- Apache and MySQL
16FIG as a Prototype for Online Error Injection
- Low run-time overhead
- Easy to enable/disable
- Easy to configure
- Extensible
- Can simulate multiple fault models
17A Case for OnlineError Injection
- Recovery code is not usually exercised during
normal operation - Deployed environments tend to differ from testing
environments - Can run error injection tests on a subset of
deployed systems - FIG can simulate common environmental errors
18Conclusions
- FIG exposed a variety of deficiencies in how our
test applications handled environmental errors - Server apps are generally more robust than client
applications - FIG exhibits low overhead
- FIG is suitable for online error injection
19(No Transcript)
20Future Directions
- Limitations of FIG
- Only for UNIX-like OSes
- Limited to app/library interface (proxy for
app/OS interaction) - Make FIG part of a larger test suite
- Include clock time and event based error triggers
- Greater flexibility in configuration file
21Other Related Work
- Xept (Vo et al.)
- Instruments object code to ensure that error
handling code exists - Processor memory errors
- DOCTOR, HYBRID, DEFINE
- Process memory corruption
- FERRARI, DEFINE