IO Subsystem Chapter 8

About This Presentation

Title:

IO Subsystem Chapter 8

Description:

Data lines include address and raw data ... Only asynchronous stuff is generation of commands or requests ... Having a single bus master could create a bottle neck ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 43

Provided by: csBing

Learn more at: http://www.cs.binghamton.edu

Category:

more less

Transcript and Presenter's Notes

Title: IO Subsystem Chapter 8

1
I/O SubsystemChapter 8
N. Guydosh 4/28/04
2
Introduction

Amazing variation of characteristics and behavors
Characteristics largely driven by technology
Not as elegant as processors or memory systems
Traditionally the study of I/O took a back seat
to processors and memory
An unfortunate situation because a computer
system is useless without I/O and Amdals law
tells that ultimately I/O is the performance
bottleneck. See example in section 8.1

Typical I/O configuration
Fig. 8.1
3
I/O Performance Metrics

A point of confusion In I/O systems, KB, MB
etc. are traditionally powers of 10 1000,
1,000,000, bytes, but in memory/processor systems
these are powers of 2 1024, 1,048,576
For simplicity lets ignore the small difference
and use only one base, say 2.
Supercomputer I/O benchmarks
Typically for check-pointing the machine want
maximum bytes/sec on output.
Transaction processing (TP)
Response time and throughput important
Lots of small I/O events, thus number of disk
accesses per second more important than
bytes/sec
Reliability very important
File Systems I/O benchmarks
These exercise I/O system with I/O commands,
example for UNIX Makedir, copy, scandir
(transverse directory tree), ReadAll (scan every
byte in every file once), make (compiling and
linking)

4
Types Characteristics of I/O Devices

Again diversity is the problem here
Devises differ significantly inbehaviorpartner
purely machine interfaced or human
interfacedData rate- ranges from a few bytes/sec
to 10s of millions bytes/sec.
See text for descriptions of various devices
commonly in use
Disk access time calculation
See book on disk organization
Components of access timeAverage seek time
move head to desired trackrotational latencies
wait for sector to get to head (0.5
rotation/RPM)transfer time - time to read or
write a sectorsometimes queuing time included
waiting for a request to get serviced.
Disk density and size affect performance and
usefulness

5
Connecting The Systembusses

A bus connects subsystems together
Connects processor, memory, and i/o devices
together
Consists of a set of wires with control logic and
a will defined protocol for using the bus
Protocol is implemented in hardware
A standard bus design was a prime factor in the
success of personal computer
Purchase a base system and grow it by adding
off the self components
Historically a very chaotic aspect of the
computer industry
The bus wars ... pci wins microchannels
loses!
Busses are a key factor in the overall
performance of a computer system

6
Connecting The Systembusses (cont.)

Some bus tradeoffs
Advantage flexibility in adding new devices
peripherals
Disadvantage A Serial reusable resource gt only
one at a time communication bottleneck
Two performance goals
High bandwidth (data rate mb/sec)
Low latency
Bus consists of a set of data lines and control
lines
Data lines include address and raw data
Because bus is shared, we need a protocol to
decide who uses it next
Bus transaction (send address receive or send
data)
Terminology is from point of view of memory
(confusing!)
Input Writes data to memory from I/O
Output Reads data from memory to I/O
See example in fig. 8.7, 8.8

7
Connecting The Systembusses (cont.)
?Mem read cmd
?address on bus
Access data in memory
? Data ready response

Data
on bus

?Write to disk
Output operation data from memory outputted to
device
Fig. 8.7
8
Connecting The Systembusses (cont.)
Write reg cmd ?
?addr on bus
?data on bus
?read from disk
Fig. 8.8
Input operation data to memory inputted from
device
9
Types of Busses

Backplane (mother board) bus
Interconnects backplane components
Plug-in feature
Typical standard busses (ISA, AT, PCI ...)
Connects to other busses
Processor - memory
Usually proprietary
High speed
Direct connection of processor to memory with
links to other busses
I/O Bus
Typically does not connect directly to memory
Usually bridge to backplane or processor-memory
bus
Example SCSI, IDE, EIDE,

10
Types of Busses (cont.)

A lot of functional overlap in above 3 types of
busses
Can put memory directly on backplane bus
Logic needs to interconnect busses (bridge chips)
Ex backplane to I/O bus
A system may have a single backplane bus
Ex old pcs (ISA/AT)
See fig 8.9, p. 659 for examples gt

11
Types of Busses Example
Single backplane older PCs
Processor/memory bus for main bus could be PCI
backplane in modern computers.
All 3 types of busses utilized here
Ex Proprietary (old IBM?)
Ex EIDE bus in a PC
Ex PCI
Fig. 8.9
12
Synchronous Vs. Asynchronous Busses

Synchronous
Bus includes clock line in control
Protocol is not very data dependent
Protocol tightly coupled to clock
Highly synchronized with clock
Completely clock driven
Only asynchronous stuff is generation of commands
or requests
Lines must be short due to clock skew
A model for this type of bus is an FSM
Disadvantagesall devices on bus must run at
same clock speedlines must be short due to clock
skew problem
Advantage can have high performance in special
applications such as processor memory bussing
Sometimes used for processor-memory bus

13
Synchronous Vs. Asynchronous Busses (cont.)

Asynchronous
Very little clock dependency
Event driven
Keeps in step via hand shakingSee example in
figure 8.10
Very versatile
Bus can be arbitrarily long
Common for standard busses
Ex Sbus (SUN), microchannel, PCI
Can even connect busses/devices using different
clocks
Disadvantage lower performance due to
handshaking?
A model for this type of bus is a pair of
interacting FSMs
See fig 8.11, P. 664 ... see performance analysis
pp. 662-663based on figure 8.10

14
Handshaking on an Asynchronous Bus
address
data
Operation data from memory to device Initially
device raised RedReq and puts address on data
lines 1. mem see ReadReq reads address from
data bus raises Ack 2. I/O device see Ack line
high releases readReq data lines 3. Mem see
readReq low drops Ack line to ack ReadReq
signal 4. mem puts data on data line and asserts
DataRdy 5. I/O see DataRdy reads the data and
signals Ack 6. Mem see Ack drops DataRdy and
releases data lines 7. I/O see DataRdy drop and
drops Ack line Note bus is bi-directional Questi
on what happens if an Ack fails to get issued?
Color coding Colored signals are from
devise Black signals are from memory
Fig. 8.10
15
FSM model of Asynchronous Busbased on example in
fig. 8.10
The numbers in each state correspond to the
numbered steps in fig. 8.10
Fig. 8.11
16
An Example (pp.662-663)

Referring to the example in fig 8.10 We will
compare the asynchronous bandwidth (BW) with a
synchronous approach
Asynchronous
40 ns per handshake (one of the 7 steps)
Synchronous
Clock cycle 50ns
Each bus transmission takes one clock cycle
Both schemes 32 bit dta bus and one word reads
from a 200ns memory
Synchronous
Send address to memory 50ns, read memory 200ns,
send data to device 50ns for a total tome of 300
ns
BW 4bytes/300ns 13.3 MB/sec
Asynchronous
Can overlap steps 2, 3, and 4 with memory access
time
Step 1 40ns
Steps 2, 3, 4 maximum3x40ns, 200ns 200ns
(steps 2,3,4 hidden by memory access)
Steps 5, 6, 7 3x40 120ns
BW 4 bytes/ (40200120)ns 11.1 MB/sec
Observation Synchronous is only 20 faster due
to overlap in handshaking
Comment asynchronous usually referred because
it ti more technology independent and more
versatile in handling different device speeds

17
An Example (pp.665-666)The Effect of Block Size
on Synchronous Bus Bandwidth

Bus description
Two cases to consider Memory bus system
supporting access of 4 word blocks (case 1)and
16 word blocks (case 2)where a word is 32 bits
in each case
64 bit (2 words) synchronous bus clocked at
200MHz (5 ns/cycle)each 64 bit transfer taking 1
clock cycle1 clock cycle needed to send the
initial address
Two idle clock cycles needed between bus
operations bus assumed to be idle before an
access
A memory access for the first 4 words is 200ns
(40 cycles) and each additional set of 4 words is
20 ns (4 cycles)
Assume that a bus transfer of the most recently
read data and a read of the next 4 words can be
overlapped.
Summary memory accessed 4 words at a time but
must be send over bus in two 2 word shots (2
cycles) since bus is only 2 words wide.
Find sustained bandwidth, latency (xfr time of
256 words, bus transactions/sec for a read
of 256 words in two cases 4-word blocks and
16-word blocks.Note interpret a bus
transaction as transferring a (4 or 16 word)
block.

18
An Example (pp.665-666)Case 1 4-word Block
Transfers

1 clock cycle to send address of block to memory
200 MHz bus is has a 5ns period
(5ns/cycle)memory access time (1st (and only) 4
words) is 200nscycles to read memory (memory
access time)/(clock cycle time)
200ns/5ns 40 cycles
2 clock cycles to send data from memorysince we
transfer 64 bits 2 words per cycle and a block
is 4 words
2 idle cycles between this transfer and the next
Note no overlap here because entire block
transferred in one access. Overlap is only
within a block for multiple accesses as in case
2 (next).
Total number of cycles for a block 45
cycles256 words to be read results in 256/4 64
blocks (transactions)thus 45x64 2880 cycles
needed for the transferlatency 2880 cycles x
5ns/cycle 14,400 ns bus transactions /sec
64/14400ns 4.44M transactions/secBW
(256x4)bytes/14400ns 71.11 MB/sec

19
An Example (pp.665-666)Case 2 16-word Block
Transfers

Timing for a 1 block (16 word) transfer

? This portion is essentially case 1
Note a 16 word block is read in four 4 word
shots, thus there will be overlap.
Total 1 40 16 57 cycles was 45 for 4
word block
Number of transactions (blocks) needed 256/16
16 transactions was 64 for 4 word blk Total
transfer time 57x16 912 cycles was 2880 for
4 word block Latency 912 cycles x 5 ns/cycle
4560 ns was 14,400ns for 4 word
block Transactions/sec 16 /4560 ns 3.51M
transactions/sec was 4.44M for 4 word block BW
(256x4)/4560ns 244.56 MB/sec was 71.11
for 4 word block
20
Controlling Bus Access

Only one on at a time
Bus controls The bus master
Controls access to bus
Initiates controls all bus requests
Slave
Never generates own requests
Responds to read and write requests
Processor always a master
Memory usually a slave
Having a single bus master could create a bottle
neck
Processor would be involved with every bus
transaction
See fig 8.12 for an example

21
Bus Control With a Single Master
Disk makes request to processor a data xfr from
memory to disk.
Processor responds by asserting read request line
to memory.
Processor acks to disk that request is being
processed. Disk now places desired address on
the bus.
Fig. 8.12
22
Controlling Bus Access Multiple Masters

Bus arbitration deciding which master gets
control of the bus p. 669
A chip (arbiter) which decides which device gets
the bus next
Typically each device has a dedicated line to the
arbitrate for requests
Arbiter will eventually issue a grant (separate
line to device)
Device now is master, uses the bus, and then
signals the arbiter when is is done with the bus.
Devices have priorities
Bus arbiter may invoke a fairness rule to low
priority device which is waiting
Arbitration time is overhead and should be
overlapped with bus transfers whenever possible
- maybe use physically separate lines for
arbitration.

23
Arbitration Schemes p. 670

Daisy chain
Chain from a high to low priority devices
Device making request takes the grant but does
not pass it on, grant passed on only by
non-requesting devices - no fairness, possible
starvation.

24
Arbitration Schemes p. 670

Centralized, parallel
Multiple request lineschosen device becomes
masterrequires central arbiter a potential
botleneck
Used by PCI
Distributed arbitration - self selection
Multiple request lines
Request place id code on bus - by examining bus
can determine priority
No need for central arbiterneed more lines for
requestsex Nubus for Apple/Mac)
Distributed arbitration by collision detection
Free for all request bus at will
Collision detector then resolves who gets it
Ethernet uses this.

25
I/O To Memory, Processor, Os Interfaces

Questions (p. 673)
How do i/o requests transform to device commands
and get transferred to a device?
How are data transfers between device and memory
done?
What is the role of the Operating System?
The OS
Device drivers operating at kernel/supervisory
mode.
Performs interrupt handling DMA services.
Functions Commands to I/O.
Respond to I/O signals ... some are interrupts.
Control data transfer ... buffers, DMA, other
algorithms, control priorities.

26
Commands To I/O devices

Two basic approaches
Direct I/O (programmed I/O or PIO)
Memory mapped I/O
PIO
Special I/O instructions in/out for Intel
Address associated with in/out put on address
bus but - the op-code context causes i/o
interface to be access ... usually registers
causes I/O activity
Address is an I/O port
Memory mapped gt see next

27
Commands To I/O devices (cont)

Memory mapped
Certain portion of address space reserved for i/o
devices
Program communicates with device in same way it
does with memory memory instructions used
If the address is in device space range, the
device controller responds with appropriate
commands to device ... read/write
User programs not allowed to access memory mapped
I/O space
Address used by instruction encodes both device
identity types of data transmission
Memory mapped is usually faster than PIO because
DMA available

28
I/O - PROCESSOR COMMUNICATIONpolling/memory
mapped

Polling is simplest way for I/O to communicate
with processor
Periodically check status bits to seen what to do
nextI/O device posts status in a special
register, Ex I am busy
Processor Continually Checks For Status Using
Either PIO Or Memory Mapped I/O
Wastes a lot of processor time because processors
are faster than I/O devices.
Much of the polls occur when the waited for event
has not yet happened
OK for slow devices such as a mouse
Under OS control, polls can be limited to periods
only when the device is active thus allowing
polling even for faster devices cheap I/O!

29
I/O - Example

Examples for slow medium high speed
deviceDetermine impact of polling overhead for 3
devices.Assume number of clock cycles per poll
is 400 and 500 MHz clock.In all cases no data
can be missed.
Example 1 a mouse polled 30 times/seccycle/sec
for polling 30 polls x 400 cyc/poll 12,000
cyc/sec of processor cycles consumed
12000/500MHz 0.002Negligible impact on
performance.
Example 2 a floppy disk
Transfers data to processor is in 16 bit (2
byte) units
and has a data rate of 50 KB/sec
Polling rate ( (50 KB/sec)/ 2 bytes/poll)
25K polls/secCycles/sec for polling 25K
polls/sec x 400 cyc/poll 107 cyc/sec of
processor cycles consumed (107cyc/sec)/500MHz
2Still tolerable

30
I/O - Example (cont.)

Example 3 a hard driveTransfers data in
four-word chunksTransfer rate is 4MB/secMust
poll at the data rate in 4-word chunks
(4MB/sec)/(16 bytes/xfr)or polling rate is 250K
polls/secCycles/sec for polling (250K
polls/sec) x (400cyc/poll) 108 cyc/sec of
processor cycles consumed (108 cyc/sec) /
500MHz 20
1/5 of processor would be used in polling the
disk!Not acceptable.
The bottom line polling works OK for low speed
devices but not for high speed devices.

31
Interrupt driven I/O

The problem with simple polling is that it must
be done when nothing is happening during a
waiting period
When CPU processing is needed for an I/O event,
the processor is interrupted.
Interrupts are asynchronous
Not associated with any particular instruction
Allows instruction completion (compare with
exceptions in chapter 5)
Interrupt must convey further information such as
identity of device and priority.
Convey this additional information by using
vectored interrupts or a cause register.

32
Interrupt Scheme
The granularity of an interrupt is a single
machine instruction. The check for pending
interrupts and processing of interrupts is done
between instructions being executed, ie., the
current instruction is completed before a pending
interrupt is processed
33
Overhead for Interrupt driven I/O

Using the previous example of a hard drive (p.
676)
data transfers in 4 word chunksTransfer rate
of 4MB/sec
Assume overhead for each transfer, including the
interrupt is 500 clock cycles
Find the of processor consumed if hard drive is
only transferring data 5 of the time causing
CPU interaction.
Answer
Interrupt rate for busy disk would be same as
previous polling rate to match the transfer
rate(250K interrupts/sec) x 500cycles/interrupt
125x106 cyc/sec
processor consumed during an XFR
125x106/500MHz 25assume disk is transferring
data 5 of the time, then processor consumed
during an XFR (average) 25 x 5 1.25
No overhead when disk is not actually
transferring data improvement over polling.

34
DMA I/O

Polling and interrupt driven I/O best with lower
bandwidth devices where cost is more a factor.
Both polling and interrupt driven I/O, puts
burden of moving data and managing the transfer
on the CPU.
Even though the processor may continue processing
during an I/O access, it ultimately must move the
I/O data from the device when tha data becomes
available or perhaps from some I/O buffer to main
memory.
In our previous example of an interrupt driven
hard disk, even though the CPU does not have to
wait for every I/O event to complete, it would
still consume 25 of the CPU cycles while the
disk is transferring data. See p. 680.
Interrupt driven I/O for high bandwidth devices
can be greatly improved if we make a device
controller transfer data directly to memory
without involving the processor DMA (Direct
Memory Access).

35
DMA I/O (cont.)

DMA is a specialized processor that transfers
data between memory and an I/O device while the
CPU goes on with other tasks.
DMA is external to the CPU ans must act as a bus
master.
The CPU first sets up the DMA registers with a
memory address the number of bytes to be
transferred.
To the requesting program, this may be seen as
setting up a control block in memory.
DMA is frequently part of the controller for a
device.
Interrupts still are used with DMA, but only to
inform processor that the I/O transfer is
complete or an error.
DMA is a form or multi or parallel processing
not a new idea IBM Channels for main frames in
the 60s.
Channels are programmable (with channel control
words), whereas DMA is generally not
programmable.

36
DMA I/O How It Works

Three steps of DMA
Processor sets up DMA Device id, operation,
source/destination, number of bytes to transfer
DMA controller arbitrates for bus Supplies
correct commands to device, source, destination,
etc.Then lets the data rip.Fancy buffering
may be used ... ping/pong buffers.May be
multi-channeled
Interrupt processor on completion of DMA or error
DMA can still have contention with processor in
competing for memory and bus contention.
Problem cycle stealing - when there is
bus/memory contention when CPU is executing a
memory word during a DMA xfr, DMA wins out and
CPU will pause instruction execution memory cycle
(cycle was stolen).

37
Overhead Using DMA

Again use the previous disk example on page 676.
Assume initial setup of DMA takes 1000 CPU cycles
Assume interrupt handling for DMA completion
takes 500 CPU cycles
Hard drive has a transfer rate of 4MB/sec and
uses DMA
The average transfer size from disk is 8KB
What of the 500MHz CPU is consumed if the disk
is actively transferring 100 of the time?
Ignore any bus contention between CPU and DMA
controller.
AnswerEach DMA transfer takes 8KB/(4MB/sec)
0.002 sec/xfrwhen the disk is constantly
transferring, it takes1000
500cyc/xfr/0.002sec/xfr 750,000 clock
cyc/secsince the CPU runs at 500MHz, then of
processor consumed (750,000 cyc/sec)/500 MHz
0.0015 ? 0.2

38
DMA Virtual Vs. Physical Addressing (p. 683)

In a VM system, should DMA use virtual addresses
or physical addresses? this topic in the book is
at best flaky-here is my take on it
If virtual addresses used
Contiguous pages in VM may not be contiguous in
PM.
DMA request is made by specifying virtual address
for starting point of data to be transferred and
the number of bytes to be transferred.
DMA unit will have to translate VA to PA for all
read/writes to/from memory a performance
problem actually the address translation may be
done by the OS which will provide DMA with the
physical addresses a scatter/gather operation
fancy DMA controllers may be able to chain
series of pages for a single request of more
than one page OS provides list of physical page
frame addresses corresponding to the multi-page
DMA block in VM. orRestrict the DMA block sizes
to integral pages translate starting address.
If physical addresses used, they may be not
contiguous in virtual memory - if page boundary
crossed. Must constrain all DMA transfers to
stay within a single page or requests must be for
a page at a time.
Also the OS must be savvy enough so it would not
relocate pages in the target/source region during
a DMA transfer.

39
DMA Memory Coherency

DMA memory/cached systems
W/O DMA all memory access is through address
translation and cache
With DMA, data is transferred to/from main memory
cache gt coherency problem
DMA read/writes are to main memory
No cache between processor DMA controller
Value of a memory location seen by DMA CPU may
differ
If DMA writes into main memory at location for
which there are corresponding pages in the cache,
the cache data seen by CPU will be obsolete.
If cache is write back, and the DMA reads a value
directly from main memory before cache does a
write back (due to lazy write backs), then the
value read by DMA will be obsolete. remember
there is a possibility that DMA will take
priority in accessing memory over the CPU to
its disadvantage.
Possible solutions see next gt

40
DMA Memory Coherency (cont.)

Some solutions see pp. 683-684
Route all I/O activity through cache
Performance hit and may be costly
May flush out good data needed by processor
... I/O data may not be that critical to the
processor at the time it arrivesthe working set
may be messed up.
OS selectively invalidates the cache for
I/O?memory operation, orforce a write back for
an I/O read from memory?I/O operation called
cache flushing.(there may be some read/write
terminology confusion here!).Some HW support
needed here.
Hardware mechanism to selectively flush (or
invalidate) cache entriesThis is a common
mechanism used in multiprocessor systems where
there are many caches for a common main memory
(the MP cache coherency problem). The same
technique works for I/O after all DMA is a form
of multiprocessing.

41
Designing an I/O System The Problem

Specifications for a system
CPU maximum instruction rate 300 MIPSaverage
number of CPU instructions per I/O in the OS
50,000
Bandwidth of memory backplane bus 100 MB/sec
SCSI-2 controller with a transfer rate of 20
MB/secthe SCSI bus on each controller can
accommodate up to 7 disks
Disk drivesread/write bandwidth of 5 MB/sec
average seek rotational latency of 10 m
The workload this system must support
64 KB reads sequential on a track
User program needs 100,000 instructions/sec per
I/O operation.This is distinct from instructions
in the OS.
The problemFind the maximum sustainable I/O
rate and the number of disks and SCSI controllers
required. Assume that reads can always be done
on an idle disk if one exists ignore disk
conflicts.

42
Designing an I/O System The Solution

Strategy There are two fixed components in the
system memory bus and CPU. Find the I/O rate
that each component can sustain and determine
which of these is the bottleneck.
Each I/O takes 100,000 user instructions and
50,000 OS instructionsMax I/O rate for CPU
(instruction rate)/(Instructions per I/O)
(300x106) / (50100)x103 2000 I/Os per sec
Each I/O transfers 64KB, thusMax I/O rate of
backplane bus (bus BW)/(bytes per I/O
(100x106)/(64x103) 1562 I/Os per sec
The bus is the bottleneck design the system to
support the bus performance of 1562 I/Os per sec.
Number of disks needed to accommodate 1562 I/Os
per secTime per I/O at the disk
seek/rotational latency transfer time
10ms 64KB/(5MB/sec) 22.8 msThus each disk
can complete 1/22.8ms 43.9 I/Os per secTo
saturate the bus, we need (1562 I/Os per sec) /
43.9 I/Os per sec 36 disks.
How many SCSI busses is this?Required transfer
rate per disk xfr size/xfr time 64KB/22.8ms
2.74MB/secAssume we can use all the SCSI bus BW.
We can place SCSI BW/xfr rate per disk
(20MB/sec)/(2.74MB/sec) 7.3 gt 7 disks on each
SCSI bus. Note SCSI bus can support a max of 7
disks.For 36 disks we need 36/7 5.14 gt 6
buses.