Aqeel Mahesri - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Aqeel Mahesri

Description:

Control Decoupling and NXA ... The achievable parallelism grows with the input data set ... maximize performance given maximum power supply or cooling ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 58
Provided by: rut127
Category:

less

Transcript and Presenter's Notes

Title: Aqeel Mahesri


1
Data Scale Applications and Architecture
  • Aqeel Mahesri
  • Center for Reliable and High Performance
    Computing
  • University of Illinois, Urbana-Champaign
  • mahesri_at_crhc.uiuc.edu

2
Outline
  • Introduction
  • Data scale applications and architecture
  • Memory system study
  • Proposed work
  • Related work
  • Conclusion

3
Previous Work
  • Data Scale Architecture
  • Aqeel Mahesri, Nicholas J. Wang, Sanjay J. Patel,
    Tradeoffs in Cache Design and Simultaneous
    Multithreading in Many-Core Architectures,
    submitted to International Conference on
    Supercomputing, July 2007
  • Control Decoupling and NXA
  • Aqeel Mahesri, Nicholas J. Wang, Sanjay J. Patel,
    Hardware Support for Software Controlled
    Multithreading, Workshop on Design, Architecture,
    and Simulation of Chip Multiprocessors, 39th
    International Symposium on Microarchitecture,
    December 2006.
  • Aqeel Mahesri, Sanjay J. Patel, Exploiting
    Parallelism Between Control and Data Computation,
    University of Illinois Technical Report,
    UILU-ENG-05-2214, September 2005.
  • Aqeel Mahesri, Exploiting Control/Data
    Parallelism, M.S. thesis, May 2004.
  • Robust Architecture
  • Nicholas J. Wang, Aqeel Mahesri, and Sanjay J.
    Patel, Examining ACE Analysis Reliability
    Estimates Using Fault Injection, 34th
    International Symposium on Computer Architecture,
    June 2007
  • Power Consumption
  • Aqeel Mahesri and Vibhore Vardhan, Power
    Consumption Breakdown on a Modern Laptop,
    Workshop on Power Aware Computing Systems, 37th
    International Symposium on Microarchitecture,
    December 2004.
  • Dynamic Optimization
  • Brian Fahs, Aqeel Mahesri, Francesco Spadini,
    Sanjay J. Patel, and Steven S. Lumetta, The
    Performance Potential of Trace-based Dynamic
    Optimization, University of Illinois Technical
    Report, UILU-ENG-04-2208, November 2004.

4
Introduction - Data Scale Architecture
  • Motivation for the project
  • architecture shift from single-thread performance
    to parallel performance
  • software shift from sequential apps to parallel
    apps
  • envision a future where trend toward parallelism
    continues
  • Goals of the project
  • select and analyze data scale applications
  • optimize parallel architecture for data scale
    applications
  • evaluate how architecture should evolve as it
    scales further

5
Motivation Uniprocessor Era
  • single-thread performance is king
  • ever larger, faster uniprocessors
  • exponential performance growth
  • but hitting limits
  • interconnect delays
  • power
  • limited ILP of sequential workloads
  • performance growth of uniprocessors is slowing
    down
  • (chart taken from Mark Horowitz)

6
Motivation Multicore Era
  • single-thread and parallel performance compete
  • uniprocessors grow slowly
  • but increasing number of cores on chip
  • most applications still sequential
  • slow performance growth for individual apps
  • performance growth for running multiple
    applications or for throughput applications

L3
7
Motivation Data Scale Era
  • parallel performance is king
  • scaling number of cores rather than performance
    of each core
  • continues to provide exponential performance
    growth
  • BUT the performance growth comes from increasing
    parallelism
  • performance growth for data scale applications

8
Motivation Emerging Parallel Workloads
  • emerging uses of computers
  • what are people going to be doing with computers
    in 10 years?
  • real-time computer vision, AI, speech and image
    recognition
  • visualization, simulation
  • RMS (Recognition, Mining, Synthesis) applications
  • graphics APIs
  • offers massive parallelism
  • sometimes sequential application tasks can be
    done in parallel
  • compilers

9
Outline
  • Introduction
  • Data scale applications and architecture
  • Memory system study
  • Proposed work
  • Related work
  • Conclusion

10
Parallelism of Workloads
  • An n-core architecture makes sense when the
    available parallelism p gt n.
  • But the number of cores n is scaling
    exponentially over time
  • need applications where
  • the required throughput scales over time
  • the available parallelism p scales over time
  • To maintain machine utilization
  • available parallelism p in the parallel part must
    grow at least as fast as n
  • sequential portion must grow no faster than the
    performance of one core

11
Data Scale Applications
  • A data scale application is one where both the
    complexity and the parallelism scale over time
  • Definition
  • Application can be parallelized
  • The achievable parallelism grows with the input
    data set
  • The data set, and hence compute time, grows
    exponentially over time at a rate fast enough to
    require taking advantage of additional
    parallelism

12
Workloads with data scale properties
  • 172.mgrid from SPECfp
  • parallelized as part of SPEC OMP
  • ILP study
  • shows parallelism is available
  • grows linearly with input
  • mgrid is a scientific application
  • multi-grid potential field solver
  • domain where we want to solve ever larger problems

13
Workloads with data scale properties
  • 173.applu from SPECfp
  • parallelized as part of SPEC OMP
  • ILP study
  • shows parallelism grows with cube of input
  • applu solves computational fluid dynamics
  • again an application domain where we want to
    solve larger problems

14
What else might be data scale?
  • Visualization
  • Raster-based graphics, ray tracing, global
    illumination, shadow volumes, dynamic texturing
  • Video processing
  • high-definition encoding, transcoding, video
    effects
  • Financial analytics
  • options pricing, ticker stream analysis
  • Physical simulation
  • real-time fluid simulation, rigid bodies,
    mesh-based simulation, facial simulation
  • Artificial intelligence
  • real-time AI, multiple intelligent agents,
    physically aware AI
  • Real-time computer vision
  • for robotics, autonomous cars, facial recognition
  • lots more deep in the bowels of the CS department

15
Architecture for Data Scale Workloads
  • Parallelism in the workload is assumed
  • single thread performance not the focus
  • performance can be increased arbitrarily by
    adding more parallelism in HW
  • hence performance must be measured against
    constraints
  • What should we optimize?
  • performance/area
  • maximize performance given maximum area
  • performance/watt
  • maximize performance given maximum power supply
    or cooling
  • performance/joule
  • minimize energy-delay product for low power

16
Architecture for Data Scale Workloads
  • How should we optimize an architecture for data
    scale workloads?
  • core design
  • ISA design
  • Out-of-order vs in-order
  • Issue width
  • SIMD vs scalar
  • Is multithreading worth it?
  • memory system
  • What to do about memory bandwidth
  • Frequency scaling, energy effects, design time,
    architectural scaling

17
Outline
  • Introduction
  • Data scale applications and architecture
  • Memory system study
  • Proposed work
  • Related work
  • Conclusion

18
Memory Latency Problem
  • uniprocessors
  • huge performance bottleneck
  • latency steady as clock rate increases - the
    memory gap
  • long latency memory access can stall machine
  • data scale applications
  • provide a way around memory latency
  • lots of threads can keep running while a long
    latency mem op completes
  • data scale architectures
  • how much chip area to devote to countering memory
    latency?

19
Cache
  • in uniprocessors
  • primary technique for overcoming memory latency
  • cache miss can stall entire machine
  • hierarchies of caches attempt to store entire
    working set
  • large fraction of chip area
  • in data scale architectures
  • cache miss only stalls a single core
  • small fraction of the machine

20
Simultaneous Multithreading
  • in uniprocessors
  • keeps machine running despite cache miss
  • requires small number of threads
  • area cost
  • in data scale architectures
  • keeps core running despite cache miss
  • but lets you put fewer cores on chip

21
CMP Architecture
P0
P1
PN
L2
L2
L2
L3
main mem
22
Methodology - Workload
  • want apps that look like targeted data scale
    workloads
  • want apps with sufficient parallelism to occupy
    all cores
  • use SPECfp and MediaBench apps
  • parallelize loops using perfect info on loop
    carried dependences
  • from def of data scale, dont want constraint
    from single-thread performance
  • generate performance numbers looking only at the
    parallel portions
  • does not necessarily reflect the parallelization
    from a compiler or programmer
  • but it doesnt matter because data scale apps are
    easy to parallelize
  • in fact a programmer can probably do a better job
  • does accurately represent resource usage for
    those apps

23
Methodology - Performance
  • use simulation to measure throughput
  • simple, fast simulation of each core
  • fixed core architecture
  • 8 stage, 2 wide, in-order pipeline
  • 2.4 GHz clock speed
  • cache design
  • vary L2 (per core) cache
  • 8kB - 2MB per core
  • vary L3 (shared) cache
  • 8kB x core count - 512kB x core count
  • latency based on cache latencies of Intel P4 and
    IBM POWER4
  • roughly proportional to square root of cache size
  • 0.45 ns to 7.1 ns for the L2

24
Methodology - Area
  • chip area core area cache area
  • assume 90nm TSMC process
  • core area
  • area of Alpha 21164 scaled from .35u to 90nm
    process
  • 13.4 mm2
  • cache area
  • taken from SRAM area data provided by AGEIA
  • 0.34 mm2 to 23.754 mm2 for each L2
  • SMT area
  • 20 increase in 13.4 mm2 core area

25
Cache Area vs. Performance
  • Area budget of 400 mm2 in 90nm process
  • More cores better than more cache
  • especially with SMT

26
Core Count vs. Performance
  • Devote less area for each core

27
Optimize With Process Scaling
  • available transistors grows with each process
  • model as increasing area budget
  • assume perfect scaling

90nm 65nm
45nm
  • Given enough threads can achieve nearly linear
    performance growth
  • SMT performance falls behind for larger area
    budgets

28
Scaling Core Count with Process Scaling
  • How did we get that speedup with increasing
    transistor budget?

90nm 65nm
45nm
  • Answer adding more cores

29
Memory System Summary
  • evaluated 2 techniques for countering memory
    latency
  • cache
  • SMT
  • found cores are a better use of area than
    additional cache
  • especially if cores are multithreaded
  • found cores are a better use of area than SMT
  • especially for large area budgets and later
    process nodes
  • main point
  • a highly parallel favors more execution resources
    versus countering memory latency

30
Outline
  • Introduction
  • Data scale applications and architecture
  • Memory system study
  • Proposed work
  • Related work
  • Conclusion

31
Overview
  • suite of data scale applications
  • modeling CMP architecture
  • hardware design studies

32
Data Scale Benchmarks
  • no standard benchmark suite for many core
    architectures
  • want to create a benchmark suite for this project
  • data scale applications
  • small enough to perform large state space
    exploration
  • representative of important future apps
  • candidates
  • SPEC OMP benchmarks
  • physics simulation - Open Dynamics Engine
  • ray tracing
  • options pricing

33
Area Model
  • current model
  • area core area cache area
  • fixed core design and size
  • cache area based on data and varies with size
  • proposed model
  • area core area cache area interconnect area
  • cache area stays same
  • core area is a map of core parameters to area
  • add up area of functional units, pipe latches,
    control logic, etc.
  • validate against real designs Alpha 21064,
    21164, 21264, 21464
  • interconnect area maps core count and area, link
    bandwidth, buffer sizes, and network topology to
    area
  • Kumar, Zyuban, Tullsen, ISCA 2005

34
Power Model
  • power and energy consumption are additional
    metrics
  • perf/watt
  • maximize performance for a fixed power budget
  • perf/joule
  • minimize energy-delay product due to limited
    energy supply
  • dynamic power model
  • numerous published models
  • Wattch
  • SimplePower
  • adapt for use in our studies

35
.
36
Programming Model Support
  • proposals to add HW to make parallel programming
    easier
  • hardware transactional memory
  • proposals to remove HW to improve perf at expense
    of programmer
  • Cell
  • evaluate possible HW support for parallel
    programming
  • HW support for data communication
  • HW support for thread management
  • metric is perf/area and perf/power
  • complete picture would consider perf/software
    cost
  • beyond scope

37
Hardware Supported Data Communication
  • proposals range from full SW managed
    communication to full HW
  • SW communication imposes SW overhead
  • less HW overhead
  • HW communication requires HW structures
  • eliminates SW overhead
  • measure performance benefit of reduced
    communication overhead
  • . . . vs. cost of extra HW

38
Hardware Thread Management
  • overhead from thread creation and scheduling
  • some massively parallel architectures manage
    threads in HW
  • GPUs NVIDIA G80 and ATI R5xx series
  • eliminates OS calls for creation and scheduling
    of threads
  • requires HW structure
  • measure performance benefit of less scheduling
    overhead
  • . . . vs. cost of extra HW

39
Outline
  • Introduction
  • Data scale applications and architecture
  • Memory system study
  • Proposed work
  • Related work
  • Conclusion

40
Related Work
  • workloads for CMPs
  • Intel-academic venture to create suite of RMS
    applications (recognition, mining, synthesis)
  • P. Dubey, Recognition, Mining, and Synthesis
    Moves Computers to the Era of Tera
  • similar apps as in our effort
  • suite is not publicly available
  • GPGPU research
  • see Owens et. al., A survey of general purpose
    computation on graphics hardware, Computer
    Graphics Forum 2007
  • fourier transform, dynamics simulation
  • 13 dwarves
  • Asanovic et. al., Landscape of Parallel Computing
    Research A View From Berkeley
  • 13 basic algorithms important for future
    performance
  • most are highly parallel
  • not full applications

41
Related Work
  • CMP optimization studies
  • on-chip network studies
  • Balfour and Dally, ICS 2006
  • synthetic workload, various topologies
  • Kumar, Zyuban, Tullsen, ISCA 2005
  • shared bus vs. peer links vs. crossbar
  • core complexity studies
  • Huh, Burger, Keckler, PACT 2001
  • copies of sequential workloads, found preference
    for higher complexity cores
  • Li et. al., HPCA 2006
  • copies of sequential workloads, vary pipeline
    with fixed area, power budgets
  • Monchiero, Canal, Gonzalez, ICS 2006
  • small scale shared memory workloads, performance,
    area, and power
  • cache design studies
  • Hsu et. al., CAN April 2005
  • server workloads, find shared caches provide
    substantial area savings
  • generally use n copies of sequential apps, or
    server benchmarks
  • still looking at sequential application
    performance/throughput
  • leads to a very different design point

42
Conclusion
  • Microprocessor architecture scaling is changing
  • from scaling single thread performance to scaling
    parallel performance
  • Workloads are changing
  • from sequential workloads to massively parallel
    workloads
  • The rise of data scale workloads
  • size of dataset, required throughput, achievable
    parallelism all grow over time
  • workloads suited for core count scaling
  • Architectures for data scale workloads
  • found additional execution resources a better use
    of area than hiding memory latency
  • will be considering core complexity vs. core
    count, inter-core communication system, hardware
    support for parallel programming

43
Backup
44
Core Count vs. Performance
  • Devote less area for each core

45
Memory System Revisited
  • re-examine previous results with constrained
    memory bandwidth
  • re-examine previous results in context of power
  • cache eases bandwidth usage
  • cache uses less power/area than cores
  • if chip is power constrained
  • limits core count
  • use cache to fill up area budget
  • SMT uses more power/perf than baseline if cores
    idle less
  • adding cache due to power constraint should make
    SMT less desirable

46
Core Complexity
  • dynamic scheduling
  • large performance benefit for uniprocessor
    workloads
  • allows execution to continue past long-latency
    operations
  • finds ILP within thread
  • benefits unclear for data scale applications
  • cost of large area overhead, 2X
  • will mean fewer cores on chip
  • less raw execution bandwidth

47
Core Complexity
  • pipeline depth
  • deeper pipeline provides higher clock speed
  • increases execution bandwidth per core
  • costs power, area for pipeline latches and bypass
    networks
  • pipeline width
  • sequential apps favor narrow pipelines
  • data scale apps have lots of parallelism
  • may favor wider execution per core
  • or may favor more cores

48
Interconnection Cache Coherence
  • multicore roadmaps feature cache coherent shared
    memory
  • with cache coherence
  • allows caching of writable shared memory
    locations
  • without cache coherence
  • writable shared memory cannot be cached
  • all reads and writes must go to shared higher
    level caches or memory
  • increases memory latency
  • measure perf/area and perf/watt effect of cache
    coherence

49
Interconnection Network
  • data scale application threads may be independent
  • e.g. graphics
  • dont need much interconnection
  • data scale application threads may not be
    independent
  • e.g. physics
  • evaluate perf/area and perf/power
  • dense vs. sparse networks
  • high vs. low bandwidth links

50
Global Optimization
  • four previous design studies provide broad
    exploration of design space
  • also want to examine interaction between
    different parameters
  • unified optimization study
  • find optimal overall design
  • scaling study
  • find optimal design points for different area
    budgets
  • examine how tradeoffs change as architectures
    scale over next decade

51
Chronological Ordering of Projects
  • planned order of proposed work
  • initial data scale suite
  • core area modeling
  • core complexity study
  • final data scale suite
  • interconnect study
  • programming model study
  • power modeling
  • global optimization study

52
NXA
  • conceptual architecture
  • 2 cores
  • connected by spawn queue
  • allows P0 to spawn work to P1 with low overhead
  • communication network
  • ensures P0 and P1 see well defined architectural
    state
  • automatically communicates shared data

53
NXA Decoupling Approach
  • master/worker approach
  • main thread runs on P0
  • master thread
  • spawns off work threads to P1
  • unidirectional flow of dependences
  • allows P1 to run far behind P0
  • a reverse dependence forces P1 and P0 to
    re-synchronize
  • critical thread on P0
  • contains control instructions, miss prone memory
    accesses, dataflow dependence spine

54
NXA Microarchitecture
55
Performance
  • average control decoupling speedup 1.16
  • average memory decoupling speedup 1.14
  • average critical path decoupling speedup 1.15
  • choosing best decoupling scheme for each program,
    average speedup is 1.20

56
Multicore NXA
57
Multicore NXA
Write a Comment
User Comments (0)
About PowerShow.com