Chapter 8 CPU and Memory: Design, Implementation, and Enhancement - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement

Description:

The Architecture of Computer Hardware and Systems Software: An Information Technology Approach ... 3rd Edition, Irv Englander. John Wiley and Sons. 2003 Wilson ... – PowerPoint PPT presentation

Number of Views:350
Avg rating:3.0/5.0
Slides: 64
Provided by: bent181
Category:

less

Transcript and Presenter's Notes

Title: Chapter 8 CPU and Memory: Design, Implementation, and Enhancement


1
Chapter 8CPU and MemoryDesign, Implementation,
and Enhancement
  • The Architecture of Computer Hardware and Systems
    Software An Information Technology Approach
  • 3rd Edition, Irv Englander
  • John Wiley and Sons ?2003
  • Wilson Wong, Bentley College
  • Linda Senne, Bentley College

2
CPU Architecture Overview
  • CISC Complex Instruction Set Computer
  • RISC Reduced Instruction Set Computer
  • CISC vs. RISC Comparisons
  • VLIW Very Long Instruction Word
  • EPIC Explicitly Parallel Instruction Computer

3
CISC Architecture
  • Examples
  • Intel x86, IBM Z-Series Mainframes, older CPU
    architectures
  • Characteristics
  • Few general purpose registers
  • Many addressing modes
  • Large number of specialized, complex instructions
  • Instructions are of varying sizes

4
Limitations of CISC Architecture
  • Complex instructions are infrequently used by
    programmers and compilers
  • Memory references, loads and stores, are slow and
    account for a significant fraction of all
    instructions
  • Procedure and function calls are a major
    bottleneck
  • Passing arguments
  • Storing and retrieving values in registers

5
RISC Features
  • Examples
  • Power PC, Sun Sparc, Motorola 68000
  • Limited and simple instruction set
  • Fixed length, fixed format instruction words
  • Enable pipelining, parallel fetches and
    executions
  • Limited addressing modes
  • Reduce complicated hardware
  • Register-oriented instruction set
  • Reduce memory accesses
  • Large bank of registers
  • Reduce memory accesses
  • Efficient procedure calls

6
RISC attempt to produce more CPU power
  • By reducing the number of data memory accesses by
    using registers more effectively.
  • By simplifying the instruction set. This is based
    on the assumption that rarely used instructions
    add hardware complexity and slows down execution.

7
CISC vs. RISC Processing
8
Studies
  • A study by Hopkins in 1987 showed 10 instructions
    accounted for 71 of all instructions executed on
    the IBM System/370. The study showed that
    optimizing the performance of LOAD, STORE, and
    BRANCH instructions could result in substantial
    increase in CPU performance.

9
Studies
  • Several studies observed that both programmers
    and compilers avoided the use of complex
    instructions when they were available. In one
    study, 85 of the statements in 5 different
    high-level languages consisted of assignment
    statements, IF statements and procedure calls.

10
Studies
  • Procedure and function calls create huge
    bottlenecks because of the need to pass
    parameters and arguments from one procedure to
    the next.

11
Use of circular register buffer
  • Provides general-purpose registers for program
    use and also offers a solution to the problem of
    copying blocks of values from one location to
    another during procedure transfers and context
    switching.

12
Circular Register Buffer
  • Typical circular register has 168 registers
    (grouped in 8s) one block of 8 used for global
    variables.
  • To a program the machine appears to have 32 (24
    for use) registers. At any given instant, the
    registers form a window in the bank. A current
    window pointer indicates the starting point of a
    window.

13
Circular Register Buffer
  • The window is divided into three parts first
    eight to store incoming parameters, middle used
    for local variables and temporary storage, and
    the final to store parameter (to pass)
  • When another procedure call occurs the current
    window pointer is shifted by 16 registers. The
    new procedure window overlaps the previous. The
    output registers for the previous procedure are
    now viewed as the parameters for the current
    procedure.

14
Circular Register Buffer
15
Circular Register Buffer- After Procedure Call
16
CISC vs. RISC Performance Comparison
  • RISC ? Simpler instructions
  • ? more instructions
  • ? more memory accesses
  • RISC ? more bus traffic and
  • increased cache memory misses
  • More registers would improve CISC performance but
    no space available for them
  • Modern CISC and RISC architectures are becoming
    similar

17
VLIW Architecture
  • Transmeta Crusoe CPU
  • 128-bit instruction bundle molecule
  • 4 32-bit atoms (atom instruction)
  • Parallel processing of 4 instructions
  • 64 general purpose registers
  • Code morphing layer
  • Translates instructions written for other CPUs
    into molecules
  • Instructions are not written directly for the
    Crusoe CPU

18
EPIC Architecture
  • Intel Itanium CPU
  • 128-bit instruction bundle
  • 3 41-bit instructions
  • 5 bits to identify type of instructions in bundle
  • 128 64-bit general purpose registers
  • 128 82-bit floating point registers
  • Intel X86 instruction set included
  • Programmers and compilers follow guidelines to
    ensure parallel execution of instructions

19
8.2 Paging
  • A method by which the computer is able to
    conceptually separate the addresses used in a
    program from the addresses that actually identify
    physical locations in memory.
  • Program addresses are referred to as logical
    addresses. The actual memory addresses are called
    physical addresses.

20
Paging
  • Paging creates a correspondence between the
    logical and physical addresses so that each
    logical address is automatically and invisibly
    transformed into a physical address by the
    computer system during program execution. This
    transformation is known as mapping.
  • The memory management unit (MMU) sits in between
    the CPU and the memory and provides the mapping
    capability.

21
Paging
  • Managed by the operating system
  • Built into the hardware
  • Independent of application

22
Logical vs. Physical Addresses
  • Logical addresses are relative locations of data,
    instructions and branch target and are separate
    from physical addresses
  • Logical addresses mapped to physical addresses
  • Physical addresses do not need to be consecutive

23
Logical vs. Physical Address
24
Page and Frames
  • Paging divides both logical and physical memory
    into equally sized blocks.
  • Each logical block is called a page.
  • Each physical block is called a frame.

25
Page Address Layout
26
Page Translation Process
27
Memory Enhancements
  • Memory is slow compared to CPU processing speeds!
  • 2Ghz CPU 1 cycle in ½ of a billionth of a
    second
  • 70ns DRAM 1 access in 70 millionth of a second
  • Methods to improvement memory accesses
  • Wide Path Memory Access
  • Retrieve multiple bytes instead of 1 byte at a
    time
  • Memory Interleaving
  • Partition memory into subsections, each with its
    own address register and data register
  • Cache Memory

28
8.3 Memory Enhancement
  • Within the fetch-execute cycle, the slowest steps
    are those that access memory
  • Memory usually made up of DRAM
  • Inexpensive
  • Each chip can store millions of bits of data
  • Drawback access time is too slow to keep up
    with modern CPU
  • Delays must be added into LOAD/STORE executions
    pipeline to allow memory to keep up.
  • Can create potential bottleneck

29
Alternatives SRAM
  • PRO
  • Two to three times faster than DRAM
  • CON
  • Requires a lot of chip real estate because
    circuitry is more complex and generates a lot of
    heat
  • More expensive
  • One or two MB of SRAM requires more space than 64
    MB of DRAM

30
Other alternatives
  • Wide path memory access
  • Memory Interleaving
  • Cache memory
  • Expanded memory

31
Wide Path Memory Access
  • Widen the data path so its possible to read or
    write several bytes of data or words (between the
    CPU and memory) in a single access
  • Can be accomplished by
  • Widening the bus data path
  • Using larger memory data register

32
Memory Interleaving
  • Divide memory into parts
  • Makes it possible to access more than one
    location at a time
  • Each part has
  • Its own address register
  • Its own data register
  • The ability to be accessed independently

33
n-way interleaving
  • Dividing memory so that successive access points
    are in different blocks.
  • Example 2-way interleaving would allow you to
    access an odd memory address and an even. If
    8-byte path width is provided, this would allow
    for access of 16 successive bytes at a time.

34
Example 4-way interleaving
  • 4-way interleaving would allow you to access 4
    different locations simultaneously.
  • Blocks 0,1,2, and 3 could be accessed
  • Blocks 0 and 2 could be accessed.
  • Blocks 1 and 5 could not (0,1,2,3) (4,5,6,7)
  • Blocks 1, 6, 88, and 123 could be accessed
    (Why??)
  • Blocks 2, 6, and 15 could not (WHY???)

35
Memory Interleaving
36
Cache Memory
  • A small amount of high-speed memory between the
    CPU and main storage
  • Invisible to the programmer cannot be addressed
    in the usual way
  • Organized in blocks
  • Blocks are used to hold an exact reproduction of
    a corresponding amount of storage from somewhere
    in memory

37
Cache Memory
  • Each block holds a tag
  • The tag identifies the location in main memory
  • Hardware cache controller checks the tags to
    determine if the memory location of the request
    is presently stored within the cache. If it is
    the cache is used as if it were main memory.

38
Cache
  • If a write instruction occurs data is stored in
    the appropriate cache memory location.
  • A request (load or store) that is satisfied this
    way is called a hit.
  • The ratio of hits to the total number of requests
    is called a hit ratio

39
Cache
  • If cache memory is full some blocks are selected
    for replacement
  • Various algorithms for replacement
  • LRU least recently used
  • Considerations read only vs. updating
  • Write through writes data back to main memory
    immediately upon change in cache
  • Write back writes data back to main memory only
    when a cache line is replaced (faster)

40
Why Cache?
  • Even the fastest hard disk has an access time of
    about 10 milliseconds
  • 2Ghz CPU waiting 10 millisecondswastes 20
    million clock cycles!


41
Cache Memory
  • Blocks 8 or 16 bytes
  • Tags location in main memory
  • Cache controller
  • hardware that checks tags
  • Cache Line
  • Unit of transfer between storage and cache memory
  • Hit Ratio ratio of hits out of total requests
  • Synchronizing cache and memory
  • Write through
  • Write back

42
Step-by-Step Use of Cache
43
Step-by-Step Use of Cache
44
Why cache works?
  • Based on the principle locality of reference
  • At any given time, most memory references will be
    confined to one or a few small regions of memory.

45
Performance Advantages
  • Hit ratios of 90 common
  • 50 improved execution speed
  • Locality of reference is why caching works
  • Most memory references confined to small region
    of memory at any given time
  • Well-written program in small loop, procedure or
    function
  • Data likely in array
  • Variables stored together

46
Two-level Caches
  • Why do the sizes of the caches have to be
    different?

47
Cache vs. Virtual Memory
  • Cache speeds up memory access
  • Virtual memory increases amount of perceived
    storage
  • independence from the configuration and capacity
    of the memory system
  • low cost per bit

48
Expanded Memory
  • Memory in excess of the base 640K (8Meg. LIM 4.0)
  • compatible with PC/XT base
  • required add on hardware and software (286 just
    soft for 3.2)
  • uses a technique called bank switching and
    utilizes a 64K block of  "upper" memory
  • software has to be written to take advantage of
    expanded memory spec.
  • on 286 386 and up machines extended memory can
    be mapped as expanded memory
  • Several software packages make use of expanded
    memory   Lotus 123   WordPerfect   DOS
    games   Windows 3.x
  • (From Dave Rathkes itk 355 website)

49
8.4 - Modern CPU Processing Methods
  • Timing Issues
  • Separate Fetch/Execute Units
  • Pipelining
  • Scalar Processing
  • Superscalar Processing

50
Timing Issues
  • Computer clock used for timing purposes
  • MHz million steps per second
  • GHz billion steps per second
  • Instructions can (and often) take more than one
    step
  • Data word width can require multiple steps

51
Clock
  • Provides a master control as to when each step in
    the instruction cycle takes place.
  • The pulses of the clock are separated
    sufficiently to assure that each step has time to
    complete, with the data settled down, before the
    results of that step are required by the next
    step.

52
Example Original IBM PC
  • The clock ran at 4.77 MHz
  • Machine performed 4.77 million steps every second
  • If a typical IMB PC instruction requires ten
    steps then 4.77/10 million ( .5 million)
    instructions could be executed per second.

53
Separate Fetch-Execute Units
  • Fetch Unit
  • Instruction fetch unit
  • Instruction decode unit
  • Determine opcode
  • Identify type of instruction and operands
  • Several instructions are fetched in parallel and
    held in a buffer until decoded and executed
  • IP Instruction Pointer register
  • Execute Unit
  • Receives instructions from the decode unit
  • Appropriate execution unit services the
    instruction

54
Alternative CPU Organization
55
Instruction Pipelining
  • Assembly-line technique to allow overlapping
    between fetch-execute cycles of sequences of
    instructions
  • Only one instruction is being executed to
    completion at a time
  • Scalar processing
  • Average instruction execution is approximately
    equal to the clock speed of the CPU
  • Problems from stalling
  • Instructions have different numbers of steps
  • Problems from branching

56
Branch Problem Solutions
  • Separate pipelines for both possibilities
  • Requiring the following instruction to not be
    dependent on the branch
  • Instruction Reordering (superscalar processing)

57
Pipelining Example
58
Scalar Processing
  • A processor that processes approximately one
    instruction per clock cycle.

59
Superscalar Processing
  • Process more than one instruction per clock cycle
  • Separate fetch and execute cycles as much as
    possible
  • Buffers for fetch and decode phases
  • Parallel execution units

60
Scalar vs. Superscalar Processing
61
Superscalar Issues
  • Out-of-order processing dependencies (hazards)
  • Data dependencies
  • Branch (flow) dependencies and speculative
    execution
  • Parallel speculative execution or branch
    prediction
  • Branch History Table
  • Register access conflicts
  • Logical registers

62
8.5 - Hardware Implementation
  • Hardware operations are implemented by logic
    gates
  • Advantages
  • Speed
  • RISC designs are simple and typically implemented
    in hardware

63
Microprogrammed Implementation
  • Microcode are tiny programs stored in ROM that
    replace CPU instructions
  • Advantages
  • More flexible
  • Easier to implement complex instructions
  • Can emulate other CPUs
  • Disadvantage
  • Requires more clock cycles
Write a Comment
User Comments (0)
About PowerShow.com