For thousand-core microprocessors - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

For thousand-core microprocessors

Description:

Ryoo, Ueng, Rodrigues, Lathara, Kelm, Gelado, Stone, Yi, Kidd, Barghsorkhi, ... Infrastructure work has slowed down ground-breaking work ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 20
Provided by: all148
Category:

less

Transcript and Presenter's Notes

Title: For thousand-core microprocessors


1
An IMplicitly PArallel Compiler Technology Based
on Phoenix
  • For thousand-core microprocessors
  • Wen-mei Hwu
  • with
  • Ryoo, Ueng, Rodrigues, Lathara, Kelm, Gelado,
    Stone, Yi, Kidd, Barghsorkhi, Mahesri, Tsao,
    Stratton, Navarro, Lumetta, Frank, Patel
  • University of Illinois, Urbana-Champaign

2
Background
  • Academic compiler research infrastructure is a
    tough business
  • IMPACT, Trimaran, and ORC for VLIW and Itanium
    processors
  • Polaris and SUIF for multiprocessors
  • LLVM for portability and safety
  • In 2001, IMPACT team moved into many-core
    compilation with MARCO FCRC funding
  • A new implicitly parallel programming model that
    balance the burden on programmers and the
    compiler in parallel programming
  • Infrastructure work has slowed down
    ground-breaking work
  • Timely visit by the Phoenix team in January 2007
  • Rapid progress has since been taking place
  • Future IMPACT research will be built on Phoenix

3
The Next Software Challenge
Big picture
  • Today, multi-core make more effective use of area
    and power than large ILP CPUs
  • Scaling from 4-core to 1000-core chips could
    happen in the next 15 years
  • All semiconductor market domains converging to
    concurrent system platforms
  • PCs, game consoles, mobile handsets, servers,
    supercomputers, networking, etc.

We need to make these systems effectively
execute valuable, demanding apps.
4
The Compiler Challenge
Compilers and tools must extend the humans
ability to manage parallelism by doing the heavy
lifting.
  • To meet this challenge, the compiler must
  • Allow simple, effective control by programmers
  • Discover and verify parallelism
  • Eliminate tedious efforts in performance tuning
  • Reduce testing and support cost of parallel
    programs

5
An Initial Experimental Platform
  • A quiet revolution and potential build-up
  • Calculation 450 GFLOPS vs. 32 GFLOPS
  • Memory Bandwidth 86.4 GB/s vs. 8.4 GB/s
  • Until last year, programmed through graphics API
  • GPU in every PC and workstation massive volume
    and potential impact

6
GeForce 8800
  • 16 highly threaded SMs, gt128 FPUs, 450 GFLOPS,
    768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

7
Some Hand-code Results
App. Archit. Bottleneck Simult. T Kernel X App X
H.264 Registers, global memory latency 3,936 20.2 1.5
LBM Shared memory capacity 3,200 12.5 12.3
RC5-72 Registers 3,072 17.1 11.0
FEM Global memory bandwidth 4,096 11.0 10.1
RPES Instruction issue rate 4,096 210.0 79.4
PNS Global memory capacity 2,048 24.0 23.7
LINPACK Global memory bandwidth, CPU-GPU data transfer 12,288 19.4 11.8
TRACF Shared memory capacity 4,096 60.2 21.6
FDTD Global memory bandwidth 1,365 10.5 1.2
MRI-Q Instruction issue rate 8,192 457.0 431.0
HKR HotChips-2007
8
Computing Q Performance
446x
GPU (V8) 96 GFLOPS
CPU (V6) 230 MFLOPS
9
Lessons Learned
  • Parallelism extraction requires global
    understanding
  • Most programmers only understand parts of an
    application
  • Algorithms need to be re-designed
  • Programmers benefit from clear view of the
    algorithmic effect on parallelism
  • Real but rare dependencies often needs to be
    ignored
  • Error checking code, etc., parallel code is often
    not equivalent to sequential code
  • Getting more than a small speedup over sequential
    code is very tricky
  • 20 versions typically experimented for each
    application to move away from architecture
    bottlenecks

10
Implicitly Parallel Programming Flow
Stylized C/C or DSL w/ assertions
Deep analysis w/ feedback assistance
Concurrency discovery
Human
For increased composability
Visualizable concurrent form
Systematic search for best/correct code gen
Code-gen space exploration
For increased scalability
Visualizable sequential assembly code with
parallel annotations
parallel execution w/ sequential semantics
Parallel HW w/sequential state gen
For increased supportability
Debugger
11
Key Ideas
  • Deep program analyses that extend programmer and
    DSE knowledge for parallelism discovery
  • Key to reduced programmer parallelization efforts
  • Exclusion of infrequent but real dependences
    using HW STU (Speculative Threading with Undo)
    support
  • Key to successful parallelization of many real
    applications
  • Rich program information maintained in IR for
    access by tools and HW
  • Key to integrate multiple programming models and
    tools
  • Intuitive, visual presentation to programmers
  • Key to good programmer understanding of algorithm
    effects
  • Managed parallel execution arrangement search
    space
  • Key to reduced programmer performance tuning
    efforts

12
Parallelism in Algorithms(H.263 motion
estimation example)
13
MPEG-4 H.263 EncoderParallelism Redicovery
(b)
(c)
(d)
(e)

(a)
14
Code Gen Space Exploration
15
Moving an Accurate Interprocedural Analysis into
Phoenix
Unification Based
Fulcra
16
Getting Started with Phoenix
  • Meetings with Phoenix team in January 2007
  • Determined the set of Phoenix API routines
    necessary to support IMPACT analyses and
    transformations
  • Received custom build of Phoenix that supports
    full type information

17
Fulcra to Phoenix Action!
  • Four step process
  • Convert IMPACTs data structure to Phoenixs
    equivalents, and from C to C/CLI.
  • Creating the initial constraint graph using
    Phoenixs IR instead of IMPACTs IR.
  • Convert the solver pointer analysis.
  • Consist of porting from C to C/CLI and dealing
    with any changes to Fulcra ported data
    structures.
  • Annotate the points-to information back into
    Phoenix's alias representation.

18
Phoenix Support Wish List
  • Access to code across file boundaries
  • LTCG
  • Access to multiple files within a pass
  • Full (Source code level) type information
  • Feed results from Fulcra back to Phoenix
  • Need more information on Phoenix alias
    representation
  • In the long run, we need highly extendable IR and
    API for Phoenix

19
Conclusion
  • Compiler research for many-cores will require a
    very high quality infrastructure with strong
    engineering support
  • New language extensions, new user models, new
    functionalities, new analyses, new
    transformations
  • We chose Phoenix based on its robustness,
    features and engineering support
  • Our current industry partners are also moving
    into Phoenix
  • We also plan to share our advanced extensions to
    the other academic Phoenix users
Write a Comment
User Comments (0)
About PowerShow.com