CS 162 Computer Architecture Lecture 2: Introduction - PowerPoint PPT Presentation

About This Presentation
Title:

CS 162 Computer Architecture Lecture 2: Introduction

Description:

Parallelism at the Instruction Level is limited because of data dependency ... How about employing multiple processors to execute the loops = Parallel ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 21
Provided by: davep173
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 162 Computer Architecture Lecture 2: Introduction


1
CS 162 Computer Architecture Lecture 2
Introduction Pipelining
  • Instructor L.N. Bhuyan
  • www.cs.ucr.edu/bhuyan/cs162

2
Review of Last Class
  • MIPS Datapath
  • Introduction to Pipelining
  • Introduction to Instruction Level Parallelism
    (ILP)
  • Introduction to VLIW

3
What is Multiprocessing
  • Parallelism at the Instruction Level is limited
    because of data dependency gt Speed up is
    limited!!
  • Abundant availability of program level
    parallelism, like Do I 1000, Loop Level
    Parallelism. How about employing multiple
    processors to execute the loops gt Parallel
    processing or Multiprocessing
  • With billion transistors on a chip, we can put a
    few CPUs in one chip gt Chip multiprocessor

4
Memory Latency Problem
  • Even if we increase CPU power, memory is the real
    bottleneck. Techniques to alleviate memory
    latency problem
  • Memory hierarchy Program locality, cache
    memory, multilevel, pages and context switching
  • Prefetching Get the instruction/data before the
    CPU needs. Good for instns because of sequential
    locality, so all modern processors use prefetch
    buffers for instns. What do with data?
  • Multithreading Can the CPU jump to another
    program when accessing memory? Its like
    multiprogramming!!

5
Hardware Multithreading
  • We need to develop a hardware multithreading
    technique because switching between threads in
    software is very time-consuming (Why?), so not
    suitable for main memory (instead of I/O) access,
    Ex Multitasking
  • Develop multiple PCs and register sets on the CPU
    so that thread switching can occur without having
    to store the register contents in main memory
    (stack, like it is done for context switching).
  • Several threads reside in the CPU simultaneously,
    and execution switches between the threads on
    main memory access.
  • How about both multiprocessors and multithreading
    on a chip? gt Network Processor

6
Architectural Comparisons (cont.)
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
7
Intel IXP1200 Network Processor
  • Initial component of the Intel Exchange
    Architecture - IXA
  • Each micro engine is a 5-stage pipeline no ILP,
    4-way multithreaded
  • 7 core multiprocessing 6 Micro engines and a
    Strong Arm Core
  • 166 MHz fundamental clock rate
  • Intel claims 2.5 Mpps IP routing for 64 byte
    packets
  • Already the most widely used NPU
  • Or more accurately the most widely admitted use

8
IXP1200 Chip Layout
  • StrongARM processing core
  • Microengines introduce new ISA
  • I/O
  • PCI
  • SDRAM
  • SRAM
  • IX PCI-like packet bus
  • On chip FIFOs
  • 16 entry 64B each

9
IXP1200 Microengine
  • 4 hardware contexts
  • Single issue processor
  • Explicit optional context switch on SRAM access
  • Registers
  • All are single ported
  • Separate GPR
  • 1536 registers total
  • 32-bit ALU
  • Can access GPR or XFER registers
  • Standard 5 stage pipe
  • 4KB SRAM instruction store not a cache!

10
Intel IXP2400 Microengine (New)
  • XScale core replaces StrongARM
  • 1.4 GHz target in 0.13-micron
  • Nearest neighbor routes added between
    microengines
  • Hardware to accelerate CRC operations and Random
    number generation
  • 16 entry CAM

11
  • MIPS Pipeline
  • Chapter 6 CS 161 Text

12
Review Single-cycle Datapath for MIPS
Stage 5
Instruction Memory (Imem)
Data Memory (Dmem)
  • Use datapath figure to represent pipeline

13
Stages of Execution in Pipelined MIPS
  • 5 stage instruction pipeline
  • 1) I-fetch Fetch Instruction, Increment PC
  • 2) Decode Instruction, Read Registers
  • 3) Execute Mem-reference Calculate
    Address R-format Perform ALU Operation
  • 4) Memory Load Read Data from Data Memory
    Store Write Data to Data Memory
  • 5) Write Back Write Data to Register

14
Pipelined Execution Representation
Time
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
Program Flow
  • To simplify pipeline, every instruction takes
    same number of steps, called stages
  • One clock cycle per stage

15
Datapath Timing Single-cycle vs. Pipelined
  • Assume the following delays for major functional
    units
  • 2 ns for a memory access or ALU operation
  • 1 ns for register file read or write
  • Total datapath delay for single-cycle
  • In pipeline machine, each stage length of
    longest delay 2ns 5 stages 10ns

Insn Insn Reg ALU Data Reg TotalType Fetch Read O
per Access Write Time beq 2ns 1ns 2ns 5ns R-for
m 2ns 1ns 2ns 1ns 6ns sw 2ns 1ns 2ns 2ns 7nslw
2ns 1ns 2ns 2ns 1ns 8ns
16
Pipelining Lessons
  • Pipelining doesnt help latency (execution time)
    of single task, it helps throughput of entire
    workload
  • Multiple tasks operating simultaneously using
    different resources
  • Potential speedup Number of pipe stages
  • Time to fill pipeline and time to drain it
    reduces speedup
  • Pipeline rate limited by slowest pipeline stage
  • Unbalanced lengths of pipe stages also reduces
    speedup

17
Single Cycle Datapath (From Ch 5)
M u x
a d d
4
ltlt 2
PCSrc
MemWrite
2521
ReadReg1
Read Addr
P C
Readdata
Readdata1
Zero
ReadReg2
310
2016
A L U
Instruc- tion
Address
Readdata2
M u x
MemTo- Reg
WriteReg
M u x
Dmem
Imem
Regs
ALU- con
WriteData
WriteData
1511
M u x
RegDst
ALU- src
RegWrite
MemRead
150
ALUOp
18
Required Changes to Datapath
  • Introduce registers to separate 5 stages by
    putting IF/ID, ID/EX, EX/MEM, and MEM/WB
    registers in the datapath.
  • Next PC value is computed in the 3rd step, but we
    need to bring in next instn in the next cycle
    Move PCSrc Mux to 1st stage. The PC is
    incremented unless there is a new branch address.
  • Branch address is computed in 3rd stage. With
    pipeline, the PC value has changed! Must carry
    the PC value along with instn. Width of IF/ID
    register (IR)(PC) 64 bits.

19
Changes to Datapath Contd.
  • For lw instn, we need write register address at
    stage 5. But the IR is now occupied by another
    instn! So, we must carry the IR destination field
    as we move along the stages. See connection in
    fig.
  • Length of ID/EX register (Reg132)(Reg232)(of
    fset32) (PC32) (destination register5)
    133 bits
  • Assignment What are the lengths of EX/MEM, and
    MEM/WB registers

20
Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch Decode
Execute Memory
Write Back
0
M
u
x
1
IF/ID
EX/MEM
ID/EX
MEM/WB
A
d
d
A
d
d
4
A
d
d
r
e
s
u
l
t
S
h
i
f
t
l
e
f
t

2
R
e
a
d
n
o
r
e
g
i
s
t
e
r

1
i
A
d
d
r
e
s
s
P
C
t
R
e
a
d
c
u
d
a
t
a

1
r
t
R
e
a
d
s
Z
e
r
o
n
r
e
g
i
s
t
e
r

2
I
A
L
U
R
e
a
d
A
L
U
0
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
1
d
a
t
a

2
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
M
Imem
u
Regs
u
W
r
i
t
e
x
x
d
a
t
a
1
0
W
r
i
t
e
Dmem
d
a
t
a
3
2
1
6
S
i
g
n
e
x
t
e
n
d
5
69 bits
64 bits
133 bits
102 bits
Write a Comment
User Comments (0)
About PowerShow.com