Aqeel Mahesri

About This Presentation

Title:

Aqeel Mahesri

Description:

Control Decoupling and NXA ... The achievable parallelism grows with the input data set ... maximize performance given maximum power supply or cooling ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 58

Provided by: rut127

Category:

more less

Transcript and Presenter's Notes

Title: Aqeel Mahesri

1
Data Scale Applications and Architecture

Aqeel Mahesri
Center for Reliable and High Performance
Computing
University of Illinois, Urbana-Champaign
mahesri_at_crhc.uiuc.edu

2
Outline

Introduction
Data scale applications and architecture
Memory system study
Proposed work
Related work
Conclusion

3
Previous Work

Data Scale Architecture
Aqeel Mahesri, Nicholas J. Wang, Sanjay J. Patel,
Tradeoffs in Cache Design and Simultaneous
Multithreading in Many-Core Architectures,
submitted to International Conference on
Supercomputing, July 2007
Control Decoupling and NXA
Aqeel Mahesri, Nicholas J. Wang, Sanjay J. Patel,
Hardware Support for Software Controlled
Multithreading, Workshop on Design, Architecture,
and Simulation of Chip Multiprocessors, 39th
International Symposium on Microarchitecture,
December 2006.
Aqeel Mahesri, Sanjay J. Patel, Exploiting
Parallelism Between Control and Data Computation,
University of Illinois Technical Report,
UILU-ENG-05-2214, September 2005.
Aqeel Mahesri, Exploiting Control/Data
Parallelism, M.S. thesis, May 2004.
Robust Architecture
Nicholas J. Wang, Aqeel Mahesri, and Sanjay J.
Patel, Examining ACE Analysis Reliability
Estimates Using Fault Injection, 34th
International Symposium on Computer Architecture,
June 2007
Power Consumption
Aqeel Mahesri and Vibhore Vardhan, Power
Consumption Breakdown on a Modern Laptop,
Workshop on Power Aware Computing Systems, 37th
International Symposium on Microarchitecture,
December 2004.
Dynamic Optimization
Brian Fahs, Aqeel Mahesri, Francesco Spadini,
Sanjay J. Patel, and Steven S. Lumetta, The
Performance Potential of Trace-based Dynamic
Optimization, University of Illinois Technical
Report, UILU-ENG-04-2208, November 2004.

4
Introduction - Data Scale Architecture

Motivation for the project
architecture shift from single-thread performance
to parallel performance
software shift from sequential apps to parallel
apps
envision a future where trend toward parallelism
continues
Goals of the project
select and analyze data scale applications
optimize parallel architecture for data scale
applications
evaluate how architecture should evolve as it
scales further

5
Motivation Uniprocessor Era

single-thread performance is king
ever larger, faster uniprocessors
exponential performance growth

but hitting limits
interconnect delays
power
limited ILP of sequential workloads
performance growth of uniprocessors is slowing
down
(chart taken from Mark Horowitz)

6
Motivation Multicore Era

single-thread and parallel performance compete
uniprocessors grow slowly
but increasing number of cores on chip
most applications still sequential
slow performance growth for individual apps
performance growth for running multiple
applications or for throughput applications

L3
7
Motivation Data Scale Era

parallel performance is king
scaling number of cores rather than performance
of each core
continues to provide exponential performance
growth
BUT the performance growth comes from increasing
parallelism
performance growth for data scale applications

8
Motivation Emerging Parallel Workloads

emerging uses of computers
what are people going to be doing with computers
in 10 years?
real-time computer vision, AI, speech and image
recognition
visualization, simulation
RMS (Recognition, Mining, Synthesis) applications
graphics APIs
offers massive parallelism
sometimes sequential application tasks can be
done in parallel
compilers

9
Outline

Introduction
Data scale applications and architecture
Memory system study
Proposed work
Related work
Conclusion

10
Parallelism of Workloads

An n-core architecture makes sense when the
available parallelism p gt n.
But the number of cores n is scaling
exponentially over time
need applications where
the required throughput scales over time
the available parallelism p scales over time
To maintain machine utilization
available parallelism p in the parallel part must
grow at least as fast as n
sequential portion must grow no faster than the
performance of one core

11
Data Scale Applications

A data scale application is one where both the
complexity and the parallelism scale over time
Definition
Application can be parallelized
The achievable parallelism grows with the input
data set
The data set, and hence compute time, grows
exponentially over time at a rate fast enough to
require taking advantage of additional
parallelism

12
Workloads with data scale properties

172.mgrid from SPECfp
parallelized as part of SPEC OMP
ILP study
shows parallelism is available
grows linearly with input
mgrid is a scientific application
multi-grid potential field solver
domain where we want to solve ever larger problems

13
Workloads with data scale properties

173.applu from SPECfp
parallelized as part of SPEC OMP
ILP study
shows parallelism grows with cube of input
applu solves computational fluid dynamics
again an application domain where we want to
solve larger problems

14
What else might be data scale?

Visualization
Raster-based graphics, ray tracing, global
illumination, shadow volumes, dynamic texturing
Video processing
high-definition encoding, transcoding, video
effects
Financial analytics
options pricing, ticker stream analysis
Physical simulation
real-time fluid simulation, rigid bodies,
mesh-based simulation, facial simulation
Artificial intelligence
real-time AI, multiple intelligent agents,
physically aware AI
Real-time computer vision
for robotics, autonomous cars, facial recognition
lots more deep in the bowels of the CS department

15
Architecture for Data Scale Workloads

Parallelism in the workload is assumed
single thread performance not the focus
performance can be increased arbitrarily by
adding more parallelism in HW
hence performance must be measured against
constraints
What should we optimize?
performance/area
maximize performance given maximum area
performance/watt
maximize performance given maximum power supply
or cooling
performance/joule
minimize energy-delay product for low power

16
Architecture for Data Scale Workloads

How should we optimize an architecture for data
scale workloads?
core design
ISA design
Out-of-order vs in-order
Issue width
SIMD vs scalar
Is multithreading worth it?
memory system
What to do about memory bandwidth
Frequency scaling, energy effects, design time,
architectural scaling

17
Outline

Introduction
Data scale applications and architecture
Memory system study
Proposed work
Related work
Conclusion

18
Memory Latency Problem

uniprocessors
huge performance bottleneck
latency steady as clock rate increases - the
memory gap
long latency memory access can stall machine
data scale applications
provide a way around memory latency
lots of threads can keep running while a long
latency mem op completes
data scale architectures
how much chip area to devote to countering memory
latency?

19
Cache

in uniprocessors
primary technique for overcoming memory latency
cache miss can stall entire machine
hierarchies of caches attempt to store entire
working set
large fraction of chip area
in data scale architectures
cache miss only stalls a single core
small fraction of the machine

20
Simultaneous Multithreading

in uniprocessors
keeps machine running despite cache miss
requires small number of threads
area cost
in data scale architectures
keeps core running despite cache miss
but lets you put fewer cores on chip

21
CMP Architecture
P0
P1
PN
L2
L2
L2
L3
main mem
22
Methodology - Workload

want apps that look like targeted data scale
workloads
want apps with sufficient parallelism to occupy
all cores
use SPECfp and MediaBench apps
parallelize loops using perfect info on loop
carried dependences
from def of data scale, dont want constraint
from single-thread performance
generate performance numbers looking only at the
parallel portions
does not necessarily reflect the parallelization
from a compiler or programmer
but it doesnt matter because data scale apps are
easy to parallelize
in fact a programmer can probably do a better job
does accurately represent resource usage for
those apps

23
Methodology - Performance

use simulation to measure throughput
simple, fast simulation of each core
fixed core architecture
8 stage, 2 wide, in-order pipeline
2.4 GHz clock speed
cache design
vary L2 (per core) cache
8kB - 2MB per core
vary L3 (shared) cache
8kB x core count - 512kB x core count
latency based on cache latencies of Intel P4 and
IBM POWER4
roughly proportional to square root of cache size
0.45 ns to 7.1 ns for the L2

24
Methodology - Area

chip area core area cache area
assume 90nm TSMC process
core area
area of Alpha 21164 scaled from .35u to 90nm
process
13.4 mm2
cache area
taken from SRAM area data provided by AGEIA
0.34 mm2 to 23.754 mm2 for each L2
SMT area
20 increase in 13.4 mm2 core area

25
Cache Area vs. Performance

Area budget of 400 mm2 in 90nm process

More cores better than more cache
especially with SMT

26
Core Count vs. Performance

Devote less area for each core

27
Optimize With Process Scaling

available transistors grows with each process
model as increasing area budget
assume perfect scaling

90nm 65nm
45nm

Given enough threads can achieve nearly linear
performance growth
SMT performance falls behind for larger area
budgets

28
Scaling Core Count with Process Scaling

How did we get that speedup with increasing
transistor budget?

90nm 65nm
45nm

Answer adding more cores

29
Memory System Summary

evaluated 2 techniques for countering memory
latency
cache
SMT
found cores are a better use of area than
additional cache
especially if cores are multithreaded
found cores are a better use of area than SMT
especially for large area budgets and later
process nodes
main point
a highly parallel favors more execution resources
versus countering memory latency

30
Outline

Introduction
Data scale applications and architecture
Memory system study
Proposed work
Related work
Conclusion

31
Overview

suite of data scale applications
modeling CMP architecture
hardware design studies

32
Data Scale Benchmarks

no standard benchmark suite for many core
architectures
want to create a benchmark suite for this project
data scale applications
small enough to perform large state space
exploration
representative of important future apps
candidates
SPEC OMP benchmarks
physics simulation - Open Dynamics Engine
ray tracing
options pricing

33
Area Model

current model
area core area cache area
fixed core design and size
cache area based on data and varies with size
proposed model
area core area cache area interconnect area
cache area stays same
core area is a map of core parameters to area
add up area of functional units, pipe latches,
control logic, etc.
validate against real designs Alpha 21064,
21164, 21264, 21464
interconnect area maps core count and area, link
bandwidth, buffer sizes, and network topology to
area
Kumar, Zyuban, Tullsen, ISCA 2005

34
Power Model

power and energy consumption are additional
metrics
perf/watt
maximize performance for a fixed power budget
perf/joule
minimize energy-delay product due to limited
energy supply
dynamic power model
numerous published models
Wattch
SimplePower
adapt for use in our studies

35
.
36
Programming Model Support

proposals to add HW to make parallel programming
easier
hardware transactional memory
proposals to remove HW to improve perf at expense
of programmer
Cell
evaluate possible HW support for parallel
programming
HW support for data communication
HW support for thread management
metric is perf/area and perf/power
complete picture would consider perf/software
cost
beyond scope

37
Hardware Supported Data Communication

proposals range from full SW managed
communication to full HW
SW communication imposes SW overhead
less HW overhead
HW communication requires HW structures
eliminates SW overhead
measure performance benefit of reduced
communication overhead
. . . vs. cost of extra HW

38
Hardware Thread Management

overhead from thread creation and scheduling
some massively parallel architectures manage
threads in HW
GPUs NVIDIA G80 and ATI R5xx series
eliminates OS calls for creation and scheduling
of threads
requires HW structure
measure performance benefit of less scheduling
overhead
. . . vs. cost of extra HW

39
Outline

Introduction
Data scale applications and architecture
Memory system study
Proposed work
Related work
Conclusion

40
Related Work

workloads for CMPs
Intel-academic venture to create suite of RMS
applications (recognition, mining, synthesis)
P. Dubey, Recognition, Mining, and Synthesis
Moves Computers to the Era of Tera
similar apps as in our effort
suite is not publicly available
GPGPU research
see Owens et. al., A survey of general purpose
computation on graphics hardware, Computer
Graphics Forum 2007
fourier transform, dynamics simulation
13 dwarves
Asanovic et. al., Landscape of Parallel Computing
Research A View From Berkeley
13 basic algorithms important for future
performance
most are highly parallel
not full applications

41
Related Work

CMP optimization studies
on-chip network studies
Balfour and Dally, ICS 2006
synthetic workload, various topologies
Kumar, Zyuban, Tullsen, ISCA 2005
shared bus vs. peer links vs. crossbar
core complexity studies
Huh, Burger, Keckler, PACT 2001
copies of sequential workloads, found preference
for higher complexity cores
Li et. al., HPCA 2006
copies of sequential workloads, vary pipeline
with fixed area, power budgets
Monchiero, Canal, Gonzalez, ICS 2006
small scale shared memory workloads, performance,
area, and power
cache design studies
Hsu et. al., CAN April 2005
server workloads, find shared caches provide
substantial area savings
generally use n copies of sequential apps, or
server benchmarks
still looking at sequential application
performance/throughput
leads to a very different design point

42
Conclusion

Microprocessor architecture scaling is changing
from scaling single thread performance to scaling
parallel performance
Workloads are changing
from sequential workloads to massively parallel
workloads
The rise of data scale workloads
size of dataset, required throughput, achievable
parallelism all grow over time
workloads suited for core count scaling
Architectures for data scale workloads
found additional execution resources a better use
of area than hiding memory latency
will be considering core complexity vs. core
count, inter-core communication system, hardware
support for parallel programming

43
Backup
44
Core Count vs. Performance

Devote less area for each core

45
Memory System Revisited

re-examine previous results with constrained
memory bandwidth
re-examine previous results in context of power
cache eases bandwidth usage
cache uses less power/area than cores
if chip is power constrained
limits core count
use cache to fill up area budget
SMT uses more power/perf than baseline if cores
idle less
adding cache due to power constraint should make
SMT less desirable

46
Core Complexity

dynamic scheduling
large performance benefit for uniprocessor
workloads
allows execution to continue past long-latency
operations
finds ILP within thread
benefits unclear for data scale applications
cost of large area overhead, 2X
will mean fewer cores on chip
less raw execution bandwidth

47
Core Complexity

pipeline depth
deeper pipeline provides higher clock speed
increases execution bandwidth per core
costs power, area for pipeline latches and bypass
networks
pipeline width
sequential apps favor narrow pipelines
data scale apps have lots of parallelism
may favor wider execution per core
or may favor more cores

48
Interconnection Cache Coherence

multicore roadmaps feature cache coherent shared
memory
with cache coherence
allows caching of writable shared memory
locations
without cache coherence
writable shared memory cannot be cached
all reads and writes must go to shared higher
level caches or memory
increases memory latency
measure perf/area and perf/watt effect of cache
coherence

49
Interconnection Network

data scale application threads may be independent
e.g. graphics
dont need much interconnection
data scale application threads may not be
independent
e.g. physics
evaluate perf/area and perf/power
dense vs. sparse networks
high vs. low bandwidth links

50
Global Optimization

four previous design studies provide broad
exploration of design space
also want to examine interaction between
different parameters
unified optimization study
find optimal overall design
scaling study
find optimal design points for different area
budgets
examine how tradeoffs change as architectures
scale over next decade

51
Chronological Ordering of Projects

planned order of proposed work
initial data scale suite
core area modeling
core complexity study
final data scale suite
interconnect study
programming model study
power modeling
global optimization study

52
NXA

conceptual architecture
2 cores
connected by spawn queue
allows P0 to spawn work to P1 with low overhead
communication network
ensures P0 and P1 see well defined architectural
state
automatically communicates shared data

53
NXA Decoupling Approach

master/worker approach
main thread runs on P0
master thread
spawns off work threads to P1
unidirectional flow of dependences
allows P1 to run far behind P0
a reverse dependence forces P1 and P0 to
re-synchronize
critical thread on P0
contains control instructions, miss prone memory
accesses, dataflow dependence spine

54
NXA Microarchitecture
55
Performance