Title: Physical Register Inlining PRI
 1Physical Register Inlining (PRI)
- Mikko H. Lipasti1, Brian Mestan2, and Erika 
Gunadi1  - 1Department of Electrical and Computer 
Engineering  - University of WisconsinMadison 
 - 2IBM Microelectronics 
 - IBM Corporation  Austin, TX
 
http//www.ece.wisc.edu/pharm 
 2Demand for Large Register Files
Instruction Window
- Deeper Pipeline 
 - Increasing pressure on Register File 
 - Lots of attention / prior work
 
  3Challenges with Scaling Register Files
- Additional pipe stages needed for access 
 - Increases branch misprediction penalty 
 - Increases scheduling misprediction penalty 
 - Requires additional bypass logic 
 - Further increases pipeline depth 
 - Increases the demand for more registers 
 
  4Physical Register Lifetime
width4
width8
  5Prior Work
- Register file caching Swenson et al. 1988, 
Zalamea et al. 2000, Postiff et al. 2001, Cruz et 
al. 2000, Borch et al. 2002  - Late AllocationGonzalez et al. 1998, Monreal et 
al. 1999  - Efficient Management 
 - Early deallocation Moudgill et al. 1993 
 - Program semantics Martin et al. 1997, Lo et al. 
1999  - Checkpointing Martinez et al. 2002, Akkary et 
al. 2003  - Value-based optimizationsJourdan et al. 1998
 
  6Early Deallocation
- Moudgill et al. 1993 
 - Focused on last read to release 
 - Avoid waiting for the next writer to commit 
 - Deallocate registers as soon as 
 - Complete (complete flag) 
 - Unmapped (unmap flag) 
 - No outstanding readers (reference counter) 
 - Still requires next writer to enter the window
 
  7Physical Register Inlining
- Exploits narrow operands sizable fraction of 
operands can be stored in less than 8 bits Canal 
et al. 2000  - Often fewer bits than needed to specify physical 
registers  - Store the value instead of the pointer 
 - Stores narrow values in map table 
 - Reduces physical register lifetime
 
  8Operand Significance
-  Also have FP graph in the paper  exploits 
0.0/1.0 (54)  
  9Outline
- Motivation 
 - Prior Work 
 - Physical Register Inlining 
 - Quick Microarchitectural Review 
 - Modifications Needed 
 - PRI  early deallocation 
 - Experiments 
 - Conclusions 
 
  10Microarchitectural Review
- Register Rename/Map Tables 
 - Maps logical names to physical names 
 - Removes false name dependences 
 - Two common types RAM and CAM 
 - CAM map is positional 
 - Not suitable for storing values
 
.
RAM map
CAM map
0
0
?
Phys reg 
1
1
?
2
2
?
Logical reg 
Logical reg 
L
Phys reg 
? 
 11Microarchitectural Review
- Allocating and Freeing Physical Registers 
 - Allocates physical register at decode  map table 
entry is updated  - Releases physical register when next writer is 
committed  - Checkpoint and Recovery of Register Map 
 - Optimization to reduce branch misprediction 
penalty  
  12Modifications to Data Flow
Dcd
Rnm
Sched
Disp
RF
Exe
Retire
Commit
Fetch
Queue
Map
Payload RAM
ALU
Narrow?
- Execution stage must allow both operands to be 
read from payload RAM  - Already supports one immediate operands 
 - Sign extension between payload RAM and the ALU 
input  - Narrow checking logic to verify if the operands 
are narrow  - Narrow datapath back to the map table
 
  13Modifications to Map Table
- Registers freed from the retire/wb stage and 
commit stage  - Tolerant of duplicate deallocations of the same 
physical register  - Once as narrow, again at next write commit 
 - Map entries need to be writable from rename stage 
and retire/wb stage 
  14Stale Pointer Problem
MAP
Checkpoints
PRF
copy
ROB
IssueQ
- Deallocating physical registers early makes these 
pointers stale  - Equivalent to the garbage collection issue 
 - Two choices 
 - Delay deallocation until pointers not valid 
(refcount)  - Update all pointers (ideal IPC) 
 
  15Map table checkpoints problem
- Map table checkpoints need to be updated in case 
of narrow operands write  - Lazy update 
 -  Complex, but not cycle time critical 
 - Checkpoint reference counting 
 -  Similar to Akkary et al. 
 -  Delays deallocation, reduces IPC benefit slightly
 
  16Example of WAR Violation
Load p1 lt MEMp7
And p2 lt p3  p4
narrow
Add p5 lt p1  p2
WAR violation
Or p2 lt p8  p9
- Rare, but frequent enough to affect performance 
 - Must have efficient solution
 
  17Rename Table WAW Hazards
Decode
Retire
Execute
Commit
Fetch
r3  r1  r2 
p4  p1  p2 
p5  p1  p2 
p4  p1  p2 
r3  r1  r2 
MAP
ROB (Dst)
p3
r3
p3p4
p3p4p5
p4
p5
WAW!
- WAW hazards 
 - Writes narrow value to a remapped map entry 
 - Must ensure that the map entry has not been 
remapped  
  18Integrating PRI with Early Deallocation
- Not all operands are narrow 
 - Reduces register lifetime further 
 - Adds unmap flags and complete flagsMoudgill et 
al. 1993  
width4
baseline
PRI
PRIER 
 19Machine Model
- 4-wide fetch, issue, commit 
 - 512 ROB, 256 LSQ 
 - 32-entry scheduler 
 - 64 physical registers 
 - Speculative scheduling with selective recovery 
 - Combined bimodal branch predictor 
 - 32KB IL1, 32KB DL1, 512KB L2 
 - 7 bits PRI for integer, 1 bit PRI for FP
 
  20Speed Up for Integer Benchmarks
- PRI (checkpoint  reference counting) performs 
substantially better than previous work  - Reference  checkpoint counting scheme performs 
close enough with ideal case (ideal  lazy)  - Combining PRI and ER increases the performance 
further 
  21PRF Occupancy for Int. Benchmarks
- PRI reduces more register file pressure than the 
previous work (ER)  - Combining PRI and ER reduces the pressure more
 
  22Speed Up for FP Benchmark
- Ammp benchmark -gt physical registers are not the 
performance bottleneck  - Art benchmark -gt a lot of narrow operands to 
exploit  - Wupwise benchmark -gt few narrow operands
 
  23Conclusion
- PRI can lead to substantial performance 
improvement for both integer and fp benchmarks  - Ideal Update of stale pointers provides marginal 
benefit  - Reference checkpoint counting is the best choice
 
  24Future Work
- Interaction of PRI with delayed register 
allocation (virtual physical register) Gonzalez 
et al. 1998  - Interaction of PRI with software-based techniques 
to deallocate dead registers  - PRI enables a binary-compatible mechanism for the 
compiler to communicate the fact that a register 
is dead to the hardware  - Compiler can simply insert load immediate of 
narrow values to any register that seems dead 
  25Questions?
  26Machine Model