Title: Chapter 8 CPU and Memory: Design, Implementation, and Enhancement
1Chapter 8CPU and MemoryDesign, Implementation,
and Enhancement
- The Architecture of Computer Hardware and Systems
Software An Information Technology Approach - 3rd Edition, Irv Englander
- John Wiley and Sons ?2003
- Wilson Wong, Bentley College
- Linda Senne, Bentley College
2CPU Architecture Overview
- CISC Complex Instruction Set Computer
- RISC Reduced Instruction Set Computer
- CISC vs. RISC Comparisons
- VLIW Very Long Instruction Word
- EPIC Explicitly Parallel Instruction Computer
3CISC Architecture
- Examples
- Intel x86, IBM Z-Series Mainframes, older CPU
architectures - Characteristics
- Few general purpose registers
- Many addressing modes
- Large number of specialized, complex instructions
- Instructions are of varying sizes
4Limitations of CISC Architecture
- Complex instructions are infrequently used by
programmers and compilers - Memory references, loads and stores, are slow and
account for a significant fraction of all
instructions - Procedure and function calls are a major
bottleneck - Passing arguments
- Storing and retrieving values in registers
5RISC Features
- Examples
- Power PC, Sun Sparc, Motorola 68000
- Limited and simple instruction set
- Fixed length, fixed format instruction words
- Enable pipelining, parallel fetches and
executions - Limited addressing modes
- Reduce complicated hardware
- Register-oriented instruction set
- Reduce memory accesses
- Large bank of registers
- Reduce memory accesses
- Efficient procedure calls
6RISC attempt to produce more CPU power
- By reducing the number of data memory accesses by
using registers more effectively. - By simplifying the instruction set. This is based
on the assumption that rarely used instructions
add hardware complexity and slows down execution.
7CISC vs. RISC Processing
8Studies
- A study by Hopkins in 1987 showed 10 instructions
accounted for 71 of all instructions executed on
the IBM System/370. The study showed that
optimizing the performance of LOAD, STORE, and
BRANCH instructions could result in substantial
increase in CPU performance.
9Studies
- Several studies observed that both programmers
and compilers avoided the use of complex
instructions when they were available. In one
study, 85 of the statements in 5 different
high-level languages consisted of assignment
statements, IF statements and procedure calls.
10Studies
- Procedure and function calls create huge
bottlenecks because of the need to pass
parameters and arguments from one procedure to
the next.
11Use of circular register buffer
- Provides general-purpose registers for program
use and also offers a solution to the problem of
copying blocks of values from one location to
another during procedure transfers and context
switching.
12Circular Register Buffer
- Typical circular register has 168 registers
(grouped in 8s) one block of 8 used for global
variables. - To a program the machine appears to have 32 (24
for use) registers. At any given instant, the
registers form a window in the bank. A current
window pointer indicates the starting point of a
window.
13Circular Register Buffer
- The window is divided into three parts first
eight to store incoming parameters, middle used
for local variables and temporary storage, and
the final to store parameter (to pass) - When another procedure call occurs the current
window pointer is shifted by 16 registers. The
new procedure window overlaps the previous. The
output registers for the previous procedure are
now viewed as the parameters for the current
procedure.
14Circular Register Buffer
15Circular Register Buffer- After Procedure Call
16CISC vs. RISC Performance Comparison
- RISC ? Simpler instructions
- ? more instructions
- ? more memory accesses
- RISC ? more bus traffic and
- increased cache memory misses
- More registers would improve CISC performance but
no space available for them - Modern CISC and RISC architectures are becoming
similar
17VLIW Architecture
- Transmeta Crusoe CPU
- 128-bit instruction bundle molecule
- 4 32-bit atoms (atom instruction)
- Parallel processing of 4 instructions
- 64 general purpose registers
- Code morphing layer
- Translates instructions written for other CPUs
into molecules - Instructions are not written directly for the
Crusoe CPU
18EPIC Architecture
- Intel Itanium CPU
- 128-bit instruction bundle
- 3 41-bit instructions
- 5 bits to identify type of instructions in bundle
- 128 64-bit general purpose registers
- 128 82-bit floating point registers
- Intel X86 instruction set included
- Programmers and compilers follow guidelines to
ensure parallel execution of instructions
198.2 Paging
- A method by which the computer is able to
conceptually separate the addresses used in a
program from the addresses that actually identify
physical locations in memory. - Program addresses are referred to as logical
addresses. The actual memory addresses are called
physical addresses.
20Paging
- Paging creates a correspondence between the
logical and physical addresses so that each
logical address is automatically and invisibly
transformed into a physical address by the
computer system during program execution. This
transformation is known as mapping. - The memory management unit (MMU) sits in between
the CPU and the memory and provides the mapping
capability.
21Paging
- Managed by the operating system
- Built into the hardware
- Independent of application
22Logical vs. Physical Addresses
- Logical addresses are relative locations of data,
instructions and branch target and are separate
from physical addresses - Logical addresses mapped to physical addresses
- Physical addresses do not need to be consecutive
23Logical vs. Physical Address
24Page and Frames
- Paging divides both logical and physical memory
into equally sized blocks. - Each logical block is called a page.
- Each physical block is called a frame.
25Page Address Layout
26Page Translation Process
27Memory Enhancements
- Memory is slow compared to CPU processing speeds!
- 2Ghz CPU 1 cycle in ½ of a billionth of a
second - 70ns DRAM 1 access in 70 millionth of a second
- Methods to improvement memory accesses
- Wide Path Memory Access
- Retrieve multiple bytes instead of 1 byte at a
time - Memory Interleaving
- Partition memory into subsections, each with its
own address register and data register - Cache Memory
288.3 Memory Enhancement
- Within the fetch-execute cycle, the slowest steps
are those that access memory - Memory usually made up of DRAM
- Inexpensive
- Each chip can store millions of bits of data
- Drawback access time is too slow to keep up
with modern CPU - Delays must be added into LOAD/STORE executions
pipeline to allow memory to keep up. - Can create potential bottleneck
29Alternatives SRAM
- PRO
- Two to three times faster than DRAM
- CON
- Requires a lot of chip real estate because
circuitry is more complex and generates a lot of
heat - More expensive
- One or two MB of SRAM requires more space than 64
MB of DRAM
30Other alternatives
- Wide path memory access
- Memory Interleaving
- Cache memory
- Expanded memory
31Wide Path Memory Access
- Widen the data path so its possible to read or
write several bytes of data or words (between the
CPU and memory) in a single access - Can be accomplished by
- Widening the bus data path
- Using larger memory data register
32Memory Interleaving
- Divide memory into parts
- Makes it possible to access more than one
location at a time - Each part has
- Its own address register
- Its own data register
- The ability to be accessed independently
33n-way interleaving
- Dividing memory so that successive access points
are in different blocks. - Example 2-way interleaving would allow you to
access an odd memory address and an even. If
8-byte path width is provided, this would allow
for access of 16 successive bytes at a time.
34Example 4-way interleaving
- 4-way interleaving would allow you to access 4
different locations simultaneously. - Blocks 0,1,2, and 3 could be accessed
- Blocks 0 and 2 could be accessed.
- Blocks 1 and 5 could not (0,1,2,3) (4,5,6,7)
- Blocks 1, 6, 88, and 123 could be accessed
(Why??) - Blocks 2, 6, and 15 could not (WHY???)
35Memory Interleaving
36Cache Memory
- A small amount of high-speed memory between the
CPU and main storage - Invisible to the programmer cannot be addressed
in the usual way - Organized in blocks
- Blocks are used to hold an exact reproduction of
a corresponding amount of storage from somewhere
in memory
37Cache Memory
- Each block holds a tag
- The tag identifies the location in main memory
- Hardware cache controller checks the tags to
determine if the memory location of the request
is presently stored within the cache. If it is
the cache is used as if it were main memory.
38Cache
- If a write instruction occurs data is stored in
the appropriate cache memory location. - A request (load or store) that is satisfied this
way is called a hit. - The ratio of hits to the total number of requests
is called a hit ratio
39Cache
- If cache memory is full some blocks are selected
for replacement - Various algorithms for replacement
- LRU least recently used
- Considerations read only vs. updating
- Write through writes data back to main memory
immediately upon change in cache - Write back writes data back to main memory only
when a cache line is replaced (faster)
40Why Cache?
- Even the fastest hard disk has an access time of
about 10 milliseconds - 2Ghz CPU waiting 10 millisecondswastes 20
million clock cycles!
41Cache Memory
- Blocks 8 or 16 bytes
- Tags location in main memory
- Cache controller
- hardware that checks tags
- Cache Line
- Unit of transfer between storage and cache memory
- Hit Ratio ratio of hits out of total requests
- Synchronizing cache and memory
- Write through
- Write back
42Step-by-Step Use of Cache
43Step-by-Step Use of Cache
44Why cache works?
- Based on the principle locality of reference
- At any given time, most memory references will be
confined to one or a few small regions of memory.
45Performance Advantages
- Hit ratios of 90 common
- 50 improved execution speed
- Locality of reference is why caching works
- Most memory references confined to small region
of memory at any given time - Well-written program in small loop, procedure or
function - Data likely in array
- Variables stored together
46Two-level Caches
- Why do the sizes of the caches have to be
different?
47Cache vs. Virtual Memory
- Cache speeds up memory access
- Virtual memory increases amount of perceived
storage - independence from the configuration and capacity
of the memory system - low cost per bit
48Expanded Memory
- Memory in excess of the base 640K (8Meg. LIM 4.0)
- compatible with PC/XT base
- required add on hardware and software (286 just
soft for 3.2) - uses a technique called bank switching and
utilizes a 64K block of "upper" memory - software has to be written to take advantage of
expanded memory spec. - on 286 386 and up machines extended memory can
be mapped as expanded memory - Several software packages make use of expanded
memory Lotus 123 WordPerfect DOS
games Windows 3.x - (From Dave Rathkes itk 355 website)
498.4 - Modern CPU Processing Methods
- Timing Issues
- Separate Fetch/Execute Units
- Pipelining
- Scalar Processing
- Superscalar Processing
50Timing Issues
- Computer clock used for timing purposes
- MHz million steps per second
- GHz billion steps per second
- Instructions can (and often) take more than one
step - Data word width can require multiple steps
51Clock
- Provides a master control as to when each step in
the instruction cycle takes place. - The pulses of the clock are separated
sufficiently to assure that each step has time to
complete, with the data settled down, before the
results of that step are required by the next
step.
52Example Original IBM PC
- The clock ran at 4.77 MHz
- Machine performed 4.77 million steps every second
- If a typical IMB PC instruction requires ten
steps then 4.77/10 million ( .5 million)
instructions could be executed per second.
53Separate Fetch-Execute Units
- Fetch Unit
- Instruction fetch unit
- Instruction decode unit
- Determine opcode
- Identify type of instruction and operands
- Several instructions are fetched in parallel and
held in a buffer until decoded and executed - IP Instruction Pointer register
- Execute Unit
- Receives instructions from the decode unit
- Appropriate execution unit services the
instruction
54Alternative CPU Organization
55Instruction Pipelining
- Assembly-line technique to allow overlapping
between fetch-execute cycles of sequences of
instructions - Only one instruction is being executed to
completion at a time - Scalar processing
- Average instruction execution is approximately
equal to the clock speed of the CPU - Problems from stalling
- Instructions have different numbers of steps
- Problems from branching
56Branch Problem Solutions
- Separate pipelines for both possibilities
- Requiring the following instruction to not be
dependent on the branch - Instruction Reordering (superscalar processing)
57Pipelining Example
58Scalar Processing
- A processor that processes approximately one
instruction per clock cycle.
59Superscalar Processing
- Process more than one instruction per clock cycle
- Separate fetch and execute cycles as much as
possible - Buffers for fetch and decode phases
- Parallel execution units
60Scalar vs. Superscalar Processing
61Superscalar Issues
- Out-of-order processing dependencies (hazards)
- Data dependencies
- Branch (flow) dependencies and speculative
execution - Parallel speculative execution or branch
prediction - Branch History Table
- Register access conflicts
- Logical registers
628.5 - Hardware Implementation
- Hardware operations are implemented by logic
gates - Advantages
- Speed
- RISC designs are simple and typically implemented
in hardware
63Microprogrammed Implementation
- Microcode are tiny programs stored in ROM that
replace CPU instructions - Advantages
- More flexible
- Easier to implement complex instructions
- Can emulate other CPUs
- Disadvantage
- Requires more clock cycles