Slide 1 of 23 - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Slide 1 of 23

Description:

Ilhyun Kim -- MICRO-36. Source of the atomicity constraint ... Ilhyun Kim -- MICRO-36. Scheduling MOPs. Instructions in a MOP are scheduled as a single unit ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 31
Provided by: ikim
Category:
Tags: kim

less

Transcript and Presenter's Notes

Title: Slide 1 of 23


1
Macro-op SchedulingRelaxing Scheduling Loop
Constraints
  • Ilhyun Kim
  • Mikko H. Lipasti
  • PHARM Team
  • University of Wisconsin-Madison

2
Its all about granularity
  • Instruction-centric hardware design
  • HW structures are built to match an instructions
    specifications
  • Controls occur at every instruction boundary
  • Instruction granularity may impose constraints on
    the hardware design space
  • Relaxing the constraints at different processing
    granularities

Half-price architecture (ISCA03)
Coarser-granular architecture
conventional
Finer
Processing granularity
Coarser
instruction
operand
macro-op
3
Outline
  • Scheduling loop constraints
  • Overview of coarser-grained scheduling
  • Macro-op scheduling implementation
  • Performance evaluation
  • Conclusions future work

4
Scheduling loop constraints
  • Loops in out-of-order execution
  • Scheduling atomicity (wakeup / select within a
    single cycle)
  • Essential for back-to-back instruction execution
  • Hard to pipeline in conventional designs
  • Poor scalability
  • Extractable ILP is a function of window size
  • Complexity increases exponentially as the size
    grows
  • Increasing pressure due to deeper pipelining and
    slower memory system

Load latency resolution loop
Scheduling loop (wakeup / select)
Exe loop (bypass)
5
Related Work
  • Scheduling atomicity
  • Speculation pipelining
  • Grandparent scheduling Stark, Select-free
    scheduling Brown
  • Poor scalability
  • Low complexity scheduling logic
  • FIFO style window Palacharla, H.Kim
  • Data-flow based window Canal, Michaud, Raasch
  • Judicious window scaling
  • Segmented windows Hrishikesh, WIB Lebeck
  • Issue queue entry sharing
  • AMD K7 (MOP), Intel Pentium M (uops fusion)
  • ? Still based on instruction-centric scheduler
    designs
  • Making a scheduling decision at every instruction
    boundary
  • Overcoming atomicity and scalability in isolation

6
Source of the atomicity constraint
  • Minimal execution latency of instruction
  • Many ALU operations have single-cycle latency
  • Schedule should keep up with execution
  • 1-cycle instructions need 1-cycle scheduling
  • Multi-cycle operations do not need atomic
    scheduling
  • ? Relax the constraints by increasing the size of
    scheduling unit
  • Combine multiple instructions into a multi-cycle
    latency unit
  • Scheduling decisions occur at multiple
    instruction boundaries
  • Attack both atomicity and scalability constraints

7
Macro-op scheduling overview
Fetch / Decode / Rename
Queue
Scheduling
RF / EXE / MEM / WB / Commit
Disp
cache ports
Coarser MOP-grained
Instruction-grained
Instruction-grained
Issue queue insert
RF
Payload RAM
EXE
I-cache Fetch
MEM
WB Commit
MOP formation
Wakeup
Select
Rename
Pipelined scheduling
Sequencing instructions
MOP pointers
Dependence information
MOP detection
Wakeup order information
8
MOP scheduling(2x) example
1
2
6
1
2
6
Macro-op (MOP)
3
5
3
4
5
  • 9 cycles
  • 16 queue entries
  • 10 cycles
  • 9 queue entries

7
9
8
4
7
10
11
8
12
select
select / wakeup
10
9
13
n
n
wakeup
select
select / wakeup
12
11
14
n1
n1
wakeup
15
13
16
14
15
16
  • Pipelined instruction scheduling of multi-cycle
    MOPs
  • Still issues original instructions consecutively
  • Larger instruction window
  • Multiple original instructions logically share a
    single issue queue entry

9
Outline
  • Scheduling loop constraints
  • Overview of coarser-grained scheduling
  • Macro-op scheduling implementation
  • Performance evaluation
  • Conclusions future work

10
Issues in grouping instructions
  • Candidate instructions
  • Single-cycle instructions integer ALU, control,
    store agen operations
  • Multi-cycle instructions (e.g. loads) do not need
    single-cycle scheduling
  • The number of source operands
  • Grouping two dependent instructions ? up to 3
    source operands
  • Allow up to 2 source operands (conventional) / no
    restriction (wired-OR)
  • MOP size
  • Bigger MOP sizes may be more beneficial
  • 2 instructions in this study
  • MOP formation scope
  • Instructions are processed in order before
    inserted into issue queue
  • Candidate instructions need to be captured within
    a reasonable scope

11
Dependence edge distance (instruction count)
49.2
50.9
27.8
48.7
37.4
56.3
40.2
47.5
42.7
47.7
37.6
44.7
total insts
  • 73 of value-generating candidates (potential MOP
    heads) have dependent candidate instructions
    (potential MOP tails)
  • An 8-instruction scope captures many dependent
    pairs
  • Variability in distances (e.g. gap vs. vortex) ?
    remember this
  • ? Our configuration grouping 2 single-cycle
    instructions within an 8-instruction scope

12
MOP detection
  • Finds groupable instruction pairs
  • Dependence matrix-based detection (detailed in
    the paper)
  • Performance is insensitive to detection latency
    (pointers reused repeatedly)
  • A pessimistic 100-cycle latency loses 0.22 of
    IPC
  • Generates MOP pointers
  • 4 bits per instruction, stored in IL1
  • A MOP pointer represents a groupable instruction
    pair

13
MOP detection Avoiding cycle conditions
  • Cycle condition examples (leading to deadlocks)
  • Conservative cycle detection heuristic
  • Precise detection is hard (multiple levels of dep
    tracking)
  • Assume a cycle if both outgoing and incoming
    edges are detected
  • Captures over 90 of MOP opportunities (compared
    to the precise detection)

?
14
MOP formation
MOP
MOP
  • Locates MOP pairs using MOP pointers
  • MOP pointers are fetched along with instructions
  • Converts register dependences to MOP dependences
  • Architected register IDs ? MOP IDs
  • Identical to register renaming
  • Except that it assigns a single ID to two
    groupable instructions
  • Reflects the fact that two instructions are
    grouped into one scheduling unit
  • Two instructions are later inserted into one
    issue entry

15
Scheduling MOPs
  • Instructions in a MOP are scheduled as a single
    unit
  • A MOP is a non-pipelined, 2-cycle operation from
    the schedulers perspective
  • Issued when all source operands are ready, incurs
    one tag broadcast
  • Wakeup / select timings

16
Sequencing instructions
sequence original insts
  • A MOP is converted back to two original
    instructions
  • The dual-entry payload RAM sends two original
    instructions
  • Original instructions are sequentially executed
    within 2 cycles
  • Register values are accessed using physical
    register IDs
  • ROB separately commits original instructions in
    order
  • MOPs do not affect precise exception or branch
    misprediction recovery

17
Outline
  • Scheduling loop constraints
  • Overview of coarser-grained scheduling
  • Macro-op scheduling implementation
  • Performance evaluation
  • Conclusions future work

18
Machine parameters
  • Simplescalar-Alpha-based 4-wide OoO speculative
    scheduling w/ selective replay, 14 stages
  • Ideally pipelined scheduler
  • conceptually equivalent to atomic scheduling 1
    extra stage
  • 128 ROB, unrestricted / 32-entry issue queue
  • 4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2),
    256K L2 (8), memory (100)
  • Combined branch prediction, fetch until the first
    taken branch
  • MOP scheduling
  • 2-cycle (pipelined) scheduling 2X MOP technique
  • 2 (conventional) or 3 (wired-OR) source operands
  • MOP detection scope 2 cycles (4-wide X 2-cycle
    up to 8 insts)
  • Spec2k INT, reduced input sets
  • Reference input sets for crafty, eon, gap (up to
    3B instructions)

19
grouped instructions
2-src
3-src
  • 2846 of total instructions are grouped
  • 1423 reduction in the instructions count in
    scheduler
  • Dependent MOP cases enable consecutive issue of
    dependent instructions

20
MOP scheduling performance(relaxed atomicity
constraint only)
Unrestricted IQ / 128 ROB
  • Up to 19 of IPC loss in 2-cycle scheduling
  • MOP scheduling restores performance
  • Enables consecutive issue of dependent
    instructions
  • 97.2 of atomic scheduling performance on average

21
Insight into MOP scheduling
  • Performance loss of 2-cycle scheduling
  • Correlated to dependence edge distance
  • Short dependence edges (e.g. gap)
  • ? instruction window is filled up with chains of
    dependent instructions
  • ? 2-cycle scheduler cannot find plenty of ready
    instructions to issue
  • MOP scheduling captures short-distance dependent
    instruction pairs
  • They are the important ones
  • Low MOP coverage due to long dependence edges
    does not matter
  • 2-cycle scheduler can find many instructions to
    issue (e.g. vortex)
  • ? MOP scheduling complements 2-cycle scheduling
  • Overall performance is less sensitive to code
    layout

22
MOP scheduling performance(relaxed atomicity
scalability constraints)
32 IQ / 128 ROB
  • Benefits from both relaxed atomicity and
    scalability constraints
  • ? Pipelined 2-cycle MOP scheduling performs
    comparably or better than atomic scheduling

23
Conclusions Future work
  • Changing processing granularity can relax the
    constraints imposed by instruction-centric
    designs
  • Constraints in instruction scheduling loop
  • Scheduling atomicity, poor scalability
  • Macro-op scheduling relaxes both constraints at a
    coarser granularity
  • Pipelined, 2-cycle macro-op scheduling can
    perform comparably or even better than atomic
    scheduling
  • Potentials for narrow bandwidth microarchitecture
  • Extending the MOP idea to the whole pipeline
    (Disp, RF, bypass)
  • e.g. achieving 4-wide machine performance using
    2-wide bandwidth

24
Questions??
25
Select-free (Brown et al.) vs. MOP scheduling
32 IQ / 128 ROB, no extra stage for MOP formation
  • 4.1 better IPC on average over
    select-free-scoreboard (best 8.3)
  • Select-free cannot outperform the atomic
    scheduling
  • Select-free scheduling is speculative and
    requires recovery operations
  • MOP scheduling is non-speculative, leading to
    many advantages

26
MOP detection MOP pointer generation
  • Finding dependent pairs
  • Dependence matrix-based detection (detailed in
    MICRO paper)
  • Insensitive to detection latency (pointers reused
    repeatedly)
  • A pessimistic 100-cycle latency loses 0.22 of
    IPC
  • Similar to instruction preprocessing in trace
    cache lines
  • MOP pointers (4 bits per instruction)

control
offset
  • Control bit (1)
  • captures up to 1 control discontinuity
  • Offset bits (3)
  • instruction count from head to tail

0 011 add r1 ? r2, r3 0 000 lw r4 ? 0(r3) 1
010 and r5 ? r4, r2 0 000 bez r1, 0xff
(taken) 0 000 sub r6 ? r5, 1
MOP pointers
27
MOP formation MOP dependence translation
  • Assigns a single ID to two MOPable instructions
  • reflecting the fact that two instructions are
    grouped into one unit
  • The process and required structure is identical
    to register renaming
  • Register values are still access based on
    original register IDs

Register rename table
MOP translation table
Logical reg ID
Physical reg ID
Logical reg ID
p3
MOP ID
1
3
1
3
I1
2
4
2
4
p5
I1
a single MOP ID is allocated to two
grouped instructions
3
5
5
I2
3
6
p6
4
p7
4
5
I2
p4
I3
5
7
6
5
I3
8
6
6
6
I4
I4
7
-
7
-


-
-
28
Inserting MOPs into issue queue
Issue queue insert
RF
EXE
I-cache Fetch
MEM
WB Commit
Payload RAM
MOP formation
Wakeup
Select
Rename
MOP pointers
Dependence information
MOP detection
Wakeup order information
  • Inserting instructions across different groups

29
Performance considerations
  • Independent MOPs
  • Group independent instructions with the same
    source dependences
  • No direct performance benefit but reduce queue
    contention
  • Last-arriving operands in tail instructions
  • Unnecessarily delays head instructions
  • MOP detection logic filters out harmful grouping
  • Create an alternative pair if any

30
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com