Challenges for High Performance Processors - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Challenges for High Performance Processors

Description:

long access latency of main memory. lack of throughput of main memory ... NIA. Network. Memory (DRAM) addressable SCM in addition. to ordinary cache ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 30
Provided by: nak75
Category:

less

Transcript and Presenter's Notes

Title: Challenges for High Performance Processors


1
Challenges for High Performance Processors
  • Hiroshi NAKAMURA
  • Research Center for Advanced Science and
    Technology, The University of Tokyo

2
Whats the challenge?
  • Our Primary Goal Performance
  • How ?
  • increase the number and/or operating frequency of
    functional units
  • AND
  • supply functional units with sufficient data
    (bandwidth)
  • Problems
  • Memory Wall
  • system performance is limited by poor memory
    performance
  • Power Wall
  • power consumption is approaching cooling
    limitation

3
Memory Wall Problem
  • Performance improvement
  • CPU 55 / year
  • DRAM 7 / year

4
Example of Memory Wall Performance of 2GHz
Pentium4 for aibici
non-blocking cache out-of-order issue
? lack of effective memory throughput
5
Recap Memory Wall Problem
  • growing gap between processor and memory speed
  • performance is limited by memory ability in High
    Performance Computing (HPC)
  • long access latency of main memory
  • lack of throughput of main memory
  • ? making full use of local memory (on-chip
    memory) of wide bandwidth is indispensable
  • on-chip memory space is valuable resource
  • not enough for HPC
  • should exploit data locality

6
Does cache work well in HPC?
  • works well in many cases, but not the best for
    HPC
  • data location and replacement by hardware
  • unfortunate line conflicts occur although most
    of data accesses are regular
  • ex. data used only once flush out other useful
    data
  • transfer size of cache ?? off-chip is fixed
  • for consecutive data larger transfer size is
    preferable
  • for non-consecutive data large line transfer
    incurs unnecessary data transfer ? waste of
    bandwidth
  • Most of HPC applications exhibit regularity in
    data access, which is sometimes not well enjoyed.

7
SCIMA (Software Controlled Integrated Memory
Architecture) kondo-ICCD2000
(joint work with Prof. Boku _at_ Univ. of Tsukuba
and others)
  • addressable SCM in addition to ordinary cache
  • a part of logical address space
  • no inclusive relations with Cache
  • SCM and cache are reconfigurable at the
    granularity of way

(SCM Software Controllable Memory)
overview of SCIMA
address space
8
Data Transfer Instruction
  • load/store
  • register ?? SCM/Cache
  • page-load/page-store
  • SCM ?? Off-Chip Memory
  • large granularity transfer
  • wider effective bandwidthby reducing latency
    stall
  • block stride transfer
  • avoid unnecessary data transfer
  • more effective utilizationof On-Chip Memory

New
Register
Cache
SCM
Off-Chip Memory
9
Strategy of Software Control
  • SCM must be controlled by software
  • arrays are classified into 6 groups

Consecutiveness
irregular
Reusability
prototype of semi-automatic compiler users
specify hints on reusability of data arrays
10
Results of Memory Traffic
  • unnecessary memory traffic is suppressed

1 - 61 of memory traffic decreases in SCIMA
11
Results of Performance
normalized execution time
  • CPU busy time
  • latency stall elapsed time due to memory
    latency
  • throughput stall elapsed time due to lack of
    throughput
  • 1.3-2.5 times faster than cache
  • latency stall reduction by large granularity of
    data transfer
  • throughput stall reduction by suppressing
    unnecessary data transfer

12
Power Wall
  • Next Focus Power Consumption of Processors
  • Is there any room for power reduction ?
  • If yes, then how to reduce ?

Trends of Heat Density
13
Observation(1) Moores Law
  • Num. of transistors doubles every 18 months

14
Observation (2) frequency
  • Frequency doubles every 3 years.
  • Number of transistors doubles every 18 months
  • Number of switching on a chip 8 times every 3
    years

15
Observation (3) performance
  • of switching on a chip 8 times every 3 years
  • effective performance 4 times every 3 years
  • microprocessor performance improved 55 per
    year from Computer Architecture A
    Quantitative Approach by J.Henessy and
    D.Patterson, Morgan Kaufmann
  • unnecessary switching chance of power
    reduction doubles every 3 years

16
An Evidence of the Observation - unnecessary
switching x2 / 3 years -
Zyuban00 _at_ ISLPED00
rename map table bypass mechanism load/store
window issue window register file functional units
flushed instruction
access energy per instruction (nJ)
committed instruction
Issue Width
  • energy/instr. increases to exploit ILP for higher
    performance
  • at functional units no increase
  • at issue window, register file
    increase
  • flushed instruction by incorrect prediction
    increase

waste of power
17
Registers
  • Register consumes a lot of power
  • roughly speaking, power ?(num. of registers) X
    (num. of ports)
  • high performance wide issue superscalar
    processors? more registers, more read/write
    ports
  • Open Question
  • in HPC, what is the best way to use many function
    units (or accelerators) from the perspective of
    register file design
  • scalar registers with SIMD operations
  • vector registers with vector operations
  • Personal Impression
  • vector registers are accessed in well-organized
    fashion, it is easy to reduce num. of ports by
    sub-banking technique
  • can vector operations make good use of local
    on-chip memory? (at least, traditional vector
    processors can never!)

18
Dual Core helps
Rule of thumb
In the same process technology
Voltage 1 Freq 1 Area 1 Power
1 Perf 1
Voltage -15 Freq -15 Area
2 Power 1 Perf 1.8
19
Multi-Core helps more
Power
Power 1/4
4
Performance
Performance 1/2
3
2
2
1
1
1
1
no need for wider instruction issue ?
4
4
Multi-Core Power efficient Better power and
thermal management
3
3
2
2
1
1
20
Leakage problem
IEEE Computer Magazine
  • How to attack leakage problem?

21
Introduction of our research
  • Innovative Power Control for Ultra Low-Power and
    High-Performance System LSIs
  • 5 years project started October, 2006
  • supported by JST (Japan Science and Technology
    Agency) as a CREST (Core Research for Evolutional
    Science and Technology) program
  • Objective drastic power reduction of
    high-performance system LSIs by innovative power
    control through tight cooperation of various
    design levels including circuit, architecture,
    and system software.
  • Members
  • Prof. H. Nakamura (U. Tokyo) architecture
    compiler leader
  • Prof. M. Namiki (Tokyo Univ of Agri. Tech) OS
  • Prof. H. Amano (Keio Univ) architecture F/E
    design
  • Prof. K. Usami (Shibaura I.T.) circuit B/E
    design

22
How to reduce leakage Power Gating
  • Focusing on Power Gating for reducing leakage
  • Inserting a Power Switch (PS) between VDD and GND
  • Turning off PS when sleep

logic gates
VDD
VDD
logic gates
GND
Virtual GND
Power Switch
23
Run-time Power Gating (RTPG)
  • control power switch at run time
  • Coarse grain Mobile processor by Renesas
  • (independent power domains for BB module, MPEG
    module, ..)
  • Fine grain (our target) power gating within a
    module

24
Fine-grain Run-time Power Gating
  • Longer sleep time is preferable
  • Leakage savings
  • Overheads power penalties for wakeup
  • Evaluation through a real chip not reported
  • Test vehicle 32b x 32b Multiplier
  • Either or both operands (input data) are likely
    less than 16-bit
  • Circuit portions to compute upper bits of product
    need not to operate ? waste leakage power

By detecting 0s at upper 16-bits of operands,
power gate internal Multiplier array
25
Test chip "Pinnacle"
real measurement
Not applied
FG-RTPG applied
  • - Exhibits good power reduction
  • - Current Status
  • Designing a pipelined microprocessor with FG-RTPG
  • Compiler (instruction scheduler) to increase
    sleep time

26
Low Power Linux Scheduler based onstatistical
modeling
  • Co-optimization of System Software and
    Architecture
  • Objective
  • process scheduler which reduce power consumption
    by DVFS (dynamic voltage and frequency scaling)
    of each process with satisfying its performance
    constraint
  • How to find the lowest frequency with satisfying
    performance constraints ?
  • it depends on hardware and program
    characteristics
  • performance ratio is different from frequency
    ratio
  • hard to find the answer straightforward
  • ? modeling by statistical analysis of hardware
    events

27
Evaluation result
Pentium M 760 (Max 2.00 GHz, FSB 533 MHz)
  • Specified threshold
  • Black dotted line
  • Perf. is within the threshold in all the cases
    except for mgrid
  • 3-7 below the threshold
  • Accurate model is obtained
  • Linux scheduler using this model is developed

May 8, 2007
27
28
Summary
  • Challenge for high performance processors
  • Memory Wall and Power Wall
  • One solution to memory wall
  • make good use of on-chip memory with software
    controllability
  • Solutions to power wall
  • many cores will relax the problem, but
  • leakage current is getting a big problem
  • new research/approach is required
  • our project Innovative Power Control for Ultra
    Low-Power and High-Performance System LSIs is
    introduced

29
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com