Chapter 8 CPU and Memory: Design, Implementation, and Enhancement - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement

Description:

The Architecture of Computer Hardware and Systems Software: An Information Technology Approach ... 3rd Edition, Irv Englander. John Wiley and Sons. 2003 Wilson ... – PowerPoint PPT presentation

Number of Views:350

Avg rating:3.0/5.0

Slides: 64

Provided by: bent181

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 8 CPU and Memory: Design, Implementation, and Enhancement

1
Chapter 8CPU and MemoryDesign, Implementation,
and Enhancement

The Architecture of Computer Hardware and Systems
Software An Information Technology Approach
3rd Edition, Irv Englander
John Wiley and Sons ?2003
Wilson Wong, Bentley College
Linda Senne, Bentley College

2
CPU Architecture Overview

CISC Complex Instruction Set Computer
RISC Reduced Instruction Set Computer
CISC vs. RISC Comparisons
VLIW Very Long Instruction Word
EPIC Explicitly Parallel Instruction Computer

3
CISC Architecture

Examples
Intel x86, IBM Z-Series Mainframes, older CPU
architectures
Characteristics
Few general purpose registers
Many addressing modes
Large number of specialized, complex instructions
Instructions are of varying sizes

4
Limitations of CISC Architecture

Complex instructions are infrequently used by
programmers and compilers
Memory references, loads and stores, are slow and
account for a significant fraction of all
instructions
Procedure and function calls are a major
bottleneck
Passing arguments
Storing and retrieving values in registers

5
RISC Features

Examples
Power PC, Sun Sparc, Motorola 68000
Limited and simple instruction set
Fixed length, fixed format instruction words
Enable pipelining, parallel fetches and
executions
Limited addressing modes
Reduce complicated hardware
Register-oriented instruction set
Reduce memory accesses
Large bank of registers
Reduce memory accesses
Efficient procedure calls

6
RISC attempt to produce more CPU power

By reducing the number of data memory accesses by
using registers more effectively.
By simplifying the instruction set. This is based
on the assumption that rarely used instructions
add hardware complexity and slows down execution.

7
CISC vs. RISC Processing
8
Studies

A study by Hopkins in 1987 showed 10 instructions
accounted for 71 of all instructions executed on
the IBM System/370. The study showed that
optimizing the performance of LOAD, STORE, and
BRANCH instructions could result in substantial
increase in CPU performance.

9
Studies

Several studies observed that both programmers
and compilers avoided the use of complex
instructions when they were available. In one
study, 85 of the statements in 5 different
high-level languages consisted of assignment
statements, IF statements and procedure calls.

10
Studies

Procedure and function calls create huge
bottlenecks because of the need to pass
parameters and arguments from one procedure to
the next.

11
Use of circular register buffer

Provides general-purpose registers for program
use and also offers a solution to the problem of
copying blocks of values from one location to
another during procedure transfers and context
switching.

12
Circular Register Buffer

Typical circular register has 168 registers
(grouped in 8s) one block of 8 used for global
variables.
To a program the machine appears to have 32 (24
for use) registers. At any given instant, the
registers form a window in the bank. A current
window pointer indicates the starting point of a
window.

13
Circular Register Buffer

The window is divided into three parts first
eight to store incoming parameters, middle used
for local variables and temporary storage, and
the final to store parameter (to pass)
When another procedure call occurs the current
window pointer is shifted by 16 registers. The
new procedure window overlaps the previous. The
output registers for the previous procedure are
now viewed as the parameters for the current
procedure.

14
Circular Register Buffer
15
Circular Register Buffer- After Procedure Call
16
CISC vs. RISC Performance Comparison

RISC ? Simpler instructions
? more instructions
? more memory accesses
RISC ? more bus traffic and
increased cache memory misses
More registers would improve CISC performance but
no space available for them
Modern CISC and RISC architectures are becoming
similar

17
VLIW Architecture

Transmeta Crusoe CPU
128-bit instruction bundle molecule
4 32-bit atoms (atom instruction)
Parallel processing of 4 instructions
64 general purpose registers
Code morphing layer
Translates instructions written for other CPUs
into molecules
Instructions are not written directly for the
Crusoe CPU

18
EPIC Architecture

Intel Itanium CPU
128-bit instruction bundle
3 41-bit instructions
5 bits to identify type of instructions in bundle
128 64-bit general purpose registers
128 82-bit floating point registers
Intel X86 instruction set included
Programmers and compilers follow guidelines to
ensure parallel execution of instructions

19
8.2 Paging

A method by which the computer is able to
conceptually separate the addresses used in a
program from the addresses that actually identify
physical locations in memory.
Program addresses are referred to as logical
addresses. The actual memory addresses are called
physical addresses.

20
Paging

Paging creates a correspondence between the
logical and physical addresses so that each
logical address is automatically and invisibly
transformed into a physical address by the
computer system during program execution. This
transformation is known as mapping.
The memory management unit (MMU) sits in between
the CPU and the memory and provides the mapping
capability.

21
Paging

Managed by the operating system
Built into the hardware
Independent of application

22
Logical vs. Physical Addresses

Logical addresses are relative locations of data,
instructions and branch target and are separate
from physical addresses
Logical addresses mapped to physical addresses
Physical addresses do not need to be consecutive

23
Logical vs. Physical Address
24
Page and Frames

Paging divides both logical and physical memory
into equally sized blocks.
Each logical block is called a page.
Each physical block is called a frame.

25
Page Address Layout
26
Page Translation Process
27
Memory Enhancements

Memory is slow compared to CPU processing speeds!
2Ghz CPU 1 cycle in ½ of a billionth of a
second
70ns DRAM 1 access in 70 millionth of a second
Methods to improvement memory accesses
Wide Path Memory Access
Retrieve multiple bytes instead of 1 byte at a
time
Memory Interleaving
Partition memory into subsections, each with its
own address register and data register
Cache Memory

28
8.3 Memory Enhancement

Within the fetch-execute cycle, the slowest steps
are those that access memory
Memory usually made up of DRAM
Inexpensive
Each chip can store millions of bits of data
Drawback access time is too slow to keep up
with modern CPU
Delays must be added into LOAD/STORE executions
pipeline to allow memory to keep up.
Can create potential bottleneck

29
Alternatives SRAM

PRO
Two to three times faster than DRAM
CON
Requires a lot of chip real estate because
circuitry is more complex and generates a lot of
heat
More expensive
One or two MB of SRAM requires more space than 64
MB of DRAM

30
Other alternatives

Wide path memory access
Memory Interleaving
Cache memory
Expanded memory

31
Wide Path Memory Access

Widen the data path so its possible to read or
write several bytes of data or words (between the
CPU and memory) in a single access
Can be accomplished by
Widening the bus data path
Using larger memory data register

32
Memory Interleaving

Divide memory into parts
Makes it possible to access more than one
location at a time
Each part has
Its own address register
Its own data register
The ability to be accessed independently

33
n-way interleaving

Dividing memory so that successive access points
are in different blocks.
Example 2-way interleaving would allow you to
access an odd memory address and an even. If
8-byte path width is provided, this would allow
for access of 16 successive bytes at a time.

34
Example 4-way interleaving

4-way interleaving would allow you to access 4
different locations simultaneously.
Blocks 0,1,2, and 3 could be accessed
Blocks 0 and 2 could be accessed.
Blocks 1 and 5 could not (0,1,2,3) (4,5,6,7)
Blocks 1, 6, 88, and 123 could be accessed
(Why??)
Blocks 2, 6, and 15 could not (WHY???)

35
Memory Interleaving
36
Cache Memory

A small amount of high-speed memory between the
CPU and main storage
Invisible to the programmer cannot be addressed
in the usual way
Organized in blocks
Blocks are used to hold an exact reproduction of
a corresponding amount of storage from somewhere
in memory

37
Cache Memory

Each block holds a tag
The tag identifies the location in main memory
Hardware cache controller checks the tags to
determine if the memory location of the request
is presently stored within the cache. If it is
the cache is used as if it were main memory.

38
Cache

If a write instruction occurs data is stored in
the appropriate cache memory location.
A request (load or store) that is satisfied this
way is called a hit.
The ratio of hits to the total number of requests
is called a hit ratio

39
Cache

If cache memory is full some blocks are selected
for replacement
Various algorithms for replacement
LRU least recently used
Considerations read only vs. updating
Write through writes data back to main memory
immediately upon change in cache
Write back writes data back to main memory only
when a cache line is replaced (faster)

40
Why Cache?

Even the fastest hard disk has an access time of
about 10 milliseconds
2Ghz CPU waiting 10 millisecondswastes 20
million clock cycles!

41
Cache Memory

Blocks 8 or 16 bytes
Tags location in main memory
Cache controller
hardware that checks tags
Cache Line
Unit of transfer between storage and cache memory
Hit Ratio ratio of hits out of total requests
Synchronizing cache and memory
Write through
Write back

42
Step-by-Step Use of Cache
43
Step-by-Step Use of Cache
44
Why cache works?

Based on the principle locality of reference
At any given time, most memory references will be
confined to one or a few small regions of memory.

45
Performance Advantages

Hit ratios of 90 common
50 improved execution speed
Locality of reference is why caching works
Most memory references confined to small region
of memory at any given time
Well-written program in small loop, procedure or
function
Data likely in array
Variables stored together

46
Two-level Caches

Why do the sizes of the caches have to be
different?

47
Cache vs. Virtual Memory

Cache speeds up memory access
Virtual memory increases amount of perceived
storage
independence from the configuration and capacity
of the memory system
low cost per bit

48
Expanded Memory

Memory in excess of the base 640K (8Meg. LIM 4.0)
compatible with PC/XT base
required add on hardware and software (286 just
soft for 3.2)
uses a technique called bank switching and
utilizes a 64K block of "upper" memory
software has to be written to take advantage of
expanded memory spec.
on 286 386 and up machines extended memory can
be mapped as expanded memory
Several software packages make use of expanded
memory Lotus 123 WordPerfect DOS
games Windows 3.x
(From Dave Rathkes itk 355 website)

49
8.4 - Modern CPU Processing Methods

Timing Issues
Separate Fetch/Execute Units
Pipelining
Scalar Processing
Superscalar Processing

50
Timing Issues

Computer clock used for timing purposes
MHz million steps per second
GHz billion steps per second
Instructions can (and often) take more than one
step
Data word width can require multiple steps

51
Clock

Provides a master control as to when each step in
the instruction cycle takes place.
The pulses of the clock are separated
sufficiently to assure that each step has time to
complete, with the data settled down, before the
results of that step are required by the next
step.

52
Example Original IBM PC

The clock ran at 4.77 MHz
Machine performed 4.77 million steps every second
If a typical IMB PC instruction requires ten
steps then 4.77/10 million ( .5 million)
instructions could be executed per second.

53
Separate Fetch-Execute Units

Fetch Unit
Instruction fetch unit
Instruction decode unit
Determine opcode
Identify type of instruction and operands
Several instructions are fetched in parallel and
held in a buffer until decoded and executed
IP Instruction Pointer register
Execute Unit
Receives instructions from the decode unit
Appropriate execution unit services the
instruction

54
Alternative CPU Organization
55
Instruction Pipelining

Assembly-line technique to allow overlapping
between fetch-execute cycles of sequences of
instructions
Only one instruction is being executed to
completion at a time
Scalar processing
Average instruction execution is approximately
equal to the clock speed of the CPU
Problems from stalling
Instructions have different numbers of steps
Problems from branching

56
Branch Problem Solutions