A Software Layer for Disk Fault Injection - PowerPoint PPT Presentation

About This Presentation
Title:

A Software Layer for Disk Fault Injection

Description:

IDE driver calls kernel module to perform request modification ... Various other RAID and/or FS papers use some form of fault injection to model failures ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 40
Provided by: dangi5
Category:

less

Transcript and Presenter's Notes

Title: A Software Layer for Disk Fault Injection


1
A Software Layer for Disk Fault Injection
  • Jake Adriaens
  • Dan Gibson

CS 736 Spring 2005 Instructor Remzi
Arpaci-Dusseau
2
Outline
  1. Introduction, Motivation, Challenges
  2. Related Work
  3. Implementation Details IDE Driver
  4. Fault Model
  5. Methods Evaluation
  6. Summary

3
Overview - 1
  • Software system for modeling IDE disk faults in
    an x86/Linux-based computer
  • Modification to IDE driver for read/write event
    interception

4
Overview - 2
  • Disks faults described at a high level
  • Faults passed to kernel-level module
  • On read/write event
  • IDE driver calls kernel module to perform request
    modification
  • Before write event, module may modify data
    to-be-written
  • After read event, module may modify data read
    from disk

5
Motivation Why purposely cause disk failures?
  • Commodity HW (and SW!) fails, usually at
    unexpected times
  • Causing failures at expected times can help
    improve fault tolerance measures
  • Can be used to determine fault tolerance of
    systems
  • Various flavors of RAID need fault injection

6
Motivation
  • Faults can happen at the worst time
  • In the middle of a PowerPoint presentation

7
(No Transcript)
8
Challenges
  • Drivers are typically written with reliability in
    mind
  • May have error detection / correction measures
  • Should these be removed? Fooled? Applauded?
  • Low-level drivers critically affect performance
    and stability of the system
  • Disk faults need not be stable, but shouldnt
    have unusual side effects

9
Challenges
  • Failure models difficult to justify
  • Disk manufacturers dont offer details on how/why
    their disks fail
  • Failstop model is widely used models complete,
    detected disk failure
  • Other models must be chosen generally to account
    for many different disks, controllers, etc.

10
Outline
  1. Introduction, Motivation, Challenges
  2. Related Work
  3. Implementation Details IDE Driver
  4. Fault Model
  5. Methods Evaluation
  6. Summary

11
Related Work
  • Software fault injection
  • Huang et. al. (and many others) use software
    fault injection for modifying cached web pages
    (ACM/ProcWWW)
  • Jarboui et. al. inject software faults into the
    Linux kernel and observe system behavior
  • Nagaraja et. al. inject faults into cluster-based
    systems

12
Related Work
  • Disk Faults, Modeling, Detection
  • Kaaniche et. al. inject disk faults to study RAID
    behavior
  • Kari et. al. presents fault detection and
    diagnosis techniques (separate studies)
  • Various other RAID and/or FS papers use some form
    of fault injection to model failures

13
Related Work
  • Hardware Fault Injection

14
Outline
  1. Introduction, Motivation, Challenges
  2. Related Work
  3. Implementation Details IDE Driver
  4. Fault Model
  5. Methods Evaluation
  6. Summary

15
Implementation
  • Core components
  • User-level parser
  • In-kernel injection module
  • In-driver upcalls
  • System calls
  • Added 20 lines to IDE driver code
  • Kernel module is demand-loaded, 250 lines in
    size
  • 2 System calls, inject_fault and getdrivesize,
    120 lines

16
Implementation User-level Console
  • Used for fault definition
  • Console interface for fault definition
  • Processes batch files
  • Checks faults for validity
  • Sector ranges, probability, etc. (more later)
  • Passes faults to kernel module

17
Implementation IDE Driver Modification
  • Added upcalls to injection module
  • Pass I/O requests to module for modification
  • Provide callback service on I/O completion
  • Added special-purpose code for certain fault
    models
  • Failstop model requires in-driver actions

18
Implementation Kernel Module
  • Receives fault lists from user-level console
  • Called by IDE driver to perform insertion when
  • LBA sector (SCSI-like) becomes known sector
    may be modified
  • Write is initiated data to be written may be
    modified
  • Read completes data may be modified before
    returning control to I/O initiator

19
Implementation System Calls
  • Added two system calls
  • inject_faults()
  • Used to pass fault definitions to kernel module
    from user space
  • getsectors()
  • Used to determine raw sector ranges of IDE
    devices by name (there are other ways to do this)

20
Implementation
21
IDE Driver (2.4.26 Linux Kernel)
  • Important structures
  • struct request
  • Information about an IDE request
  • READ / WRITE
  • Number of sectors
  • Etc
  • struct ide_drive_s (_t)
  • Information about a drive
  • Drive name (eg. hdc)
  • Sizing/addressing information
  • Etc

22
IDE Driver (2.4.26 Linux Kernel)
  • Functions
  • ide_do_rw_disk (3 versions)
  • Common choke-point for reads writes
  • Many other similar functions, only this one in
    use
  • Two versions, swapped by preprocessor directives
    (one for DMA, one for PIO)

23
Outline
  1. Introduction, Motivation, Challenges
  2. Related Work
  3. Implementation Details
  4. Fault Model
  5. Methods Evaluation
  6. Summary

24
Failure Model
  • Models selected to represent generic IDE disk
  • No modeling of specific failure (i.e. Western
    Digitals classic servo malfunction)
  • Models based on ranges of affected logical
    sectors (ala SCSI)

25
Failure Model Fault Types
  • sectorfail
  • Models inability of a given sector (block) or
    sector range to store data reliably
  • Excited on read of sector
  • Data read is permuted in some way
  • Randomized
  • Set to specific value
  • Added to offset
  • Shifted by one or more bytes

26
Failure Model Fault Types
  • sectorro
  • Writes to block have no effect on stored value
  • Excited on writes to sector
  • Write requests ignored
  • sectorwrong
  • Traffic to a given block is directed to a
    different block
  • Excited on reads writes
  • Address permuted, similarly to data

27
Failure Model Fault Types
  • transaddr
  • Sector number wrong for first fault excitation,
    but right for all others
  • Excited on reads writes
  • Sector permuted as in sectorwrong
  • transdata
  • Data is wrong for first fault excitation
  • Data permuted as in sectorfail

28
Failure Model Fault Types
  • failstop
  • Drive is totally unresponsiveperforms no reads
    or writes
  • Differs from traditional Failstop in that our
    failstop is invisible
  • Drive does not report any errors, simply fails to
    perform reads or writes to any sector

29
Outline
  1. Introduction, Motivation, Challenges
  2. Related Work
  3. Implementation Details
  4. Fault Model
  5. Methods Evaluation
  6. Summary

30
Verification of Faults (?)
  • Faults excited and observed by microbenchmarks
    tailored to individual fault types
  • Techniques similar to latent fault detection
    (Kari et. al., and other studies)
  • Verification of faults is fault-specific

31
Verification - sectorfail
  • Corrupts data when read from disk
  • Write known data to disk - observe location using
    printk statement
  • Inject sectorfail fault at location of file on
    disk.
  • Unmount/remount FS (flush cache)
  • Attempt to read faulty file (with cat)

32
Verification - sectorro
  • Ignores writes to a given location
  • Write known data to disk
  • Inject sectorro fault
  • Flush file cache
  • Write different data to same location
  • Flush file cache
  • Read data from (1) from disk

33
Verification - sectorwrong
  • Changes address (sector) to another sector number
  • Write known data to disk
  • Flush file cache
  • Inject sectorwrong faultredirect to known
    location
  • Read from file observe data from other sector

34
Verification - transdata
  • Data modified after read, but only the first time
  • Verify sectorfail functionality
  • Flush file cache
  • Re-read, expect correct data

35
Verification - transaddr
  • Sector number modified before reads writes
  • Verify sectorwrong functionality
  • Flush file cache
  • Repeat read, expect correct data

36
Verification - failstop
  • Easy!
  • Install failstop fault
  • Attempt to access any portion of affected drive
  • Expect bad things
  • Usually causes kernel panic

37
Evaluation
  • Execution time overhead of injection SW
  • Overhead ltlt standard dev. of runtime for
    unaffected regions of disk space
  • Overhead ltlt standard dev. of runtime for affected
    regions
  • Averaged over 250 accesses

Avg. (ms) Std.Dev.
No injection 3.025 0.075
Unaffected region 3.020 0.076
Affected Region
38
Outline
  1. Introduction, Motivation, Challenges
  2. Related Work
  3. Implementation Details
  4. Fault Model
  5. Methods Evaluation
  6. Summary

39
Summary
  • Present five new failure models for disk
    accesses, and the ability to inject them
  • Verified fault manifestation
  • Did not verify potential side effects ?
  • Fault injection has no noticeable effect on
    access times
  • Small SW overhead much smaller than access time
    to physical device
Write a Comment
User Comments (0)
About PowerShow.com