Latency vs. Bandwidth Which Matters More?

About This Presentation

Title:

Latency vs. Bandwidth Which Matters More?

Description:

www.eecs.berkeley.edu – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 68

Provided by: brg

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Latency vs. Bandwidth Which Matters More?

1
Latency vs. BandwidthWhich Matters More?

Katherine Yelick
U.C. Berkeley and LBNL

Joint with with Xiaoye Li, Lenny Oliker, Brian
Gaeke, Parry Husbands (LBNL) The Berkeley IRAM
group Dave Patterson, Joe Gebis, Dave Judd,
Christoforos Kozyrakis, Sam Williams, The
Berkeley Bebop group Jim Demmel, Rich Vuduc, Ben
Lee, Rajesh Nishtala,
2
Blame the Memory Bus

Many scientific applications run at less than 10
of hardware peak, even on a single processor
The trend is to blame the memory bus
Is this accurate?
Need to understand bottlenecks to
Design better machines
Design better algorithms
Two parts
Algorithm bottlenecks on microprocessors
Bottlenecks on a PIM system, VIRAM

Note this is latency, not bandwidth.
3
Memory Intensive Applications

Poor performance is especially problematic for
memory-intensive applications
Low ratio of arithmetic operations to memory
Irregular memory access patterns
Example
Sparse matrix-vector multiply (dominant kernel of
NAS CG)
Many scientific applications do this by some
perspective
Compute y y Ax
Matrix is stored as two main arrays
Column index array (int)
Value array (floating point)
For each element yi compute
Sj xindexj valuej
So latency (to x) dominates, right?
Irregular
Not necessarily in cache

x
y
4
Performance Model is Revealing

A simple analytical model for sparse matvec
kernel
loads from memory cost of load loads from
cache

Two versions
Only compulsory misses to source vector, x
All accesses to x produce a miss to memory
Conclusion
Cache misses to source (memory latency) is not
the dominant cost
PAPI measurements confirm

So bandwidth to the matrix dominates, right?

5
Memory Bandwidth Measurements

Yes, but be careful about how you measure
bandwidth
Not a constant

6
An Architectural Probe

Sqmat is a tunable probe to measure architectures
Stream of small matrices
Square each matrix to some power computational
intensity
The stream may be direct (dense), or indirect
(sparse)
If indirect, how frequently is there a non-unit
stride jump
Parameters
Matrix size within stream
Computational Intensity
Indirection (yes/no)
unit strides before jump

. . .
. . .
7
Cost of Indirection

Adding a second load stream for indexes into
stream has a big effect on some machines
This is truly a bandwidth issue

8
Cost of Irregularity
Opteron
Itanium2
Power3
Power4

Slowdown relative to the previous slide results
Even a tiny bit of irregularity (1/S) can have a
big effect

9
What Does This Have to Do with PIMs?

Performance of Sqmat on PIMs and others for 3x3
matrices, squared 10 times (high computational
intensity!)
Imagine much faster for long streams, slower for
short ones

10
VIRAM Overview

Technology IBM SA-27E
0.18mm CMOS, 6 metal layers
290 mm2 die area
225 mm2 for memory/logic
Transistor count 130M
13 MB of DRAM
Power supply
1.2V for logic, 1.8V for DRAM
Typical power consumption 2.0 W
0.5 W (scalar) 1.0 W (vector) 0.2 W (DRAM)
0.3 W (misc)
MIPS Scalar core 4-lane vector
Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b
operations)
3.2/6.4 /12.8 Gops w. madd
1.6 Gflops (single-precision)

11
Vector IRAM ISA Summary
Scalar
MIPS64 scalar instruction set
s.int u.int s.fp d.fp
.v .vv .vs .sv
Vector ALU
alu op
unit stride constant stride indexed
Vector Memory
s.int u.int
load store
ALU operations integer, floating-point,
fixed-point and DSP, convert, logical, vector
processing, flag processing
12
VIRAM Compiler
Optimizer
Frontends
Code Generators
C
T3D/T3E
Crays PDGCS
C
C90/T90/X1
Fortran95
SV2/VIRAM

Based on the Crays production compiler
Challenges
narrow data types and scalar/vector memory
consistency
Advantages relative to media-extensions
powerful addressing modes and ISA independent of
datapath width

13
Compiler and OS Enhancements

Compiler based on Cray PDGCS
Outer-loop vectorization
Strided and indexed vector loads and stores
Vectorization of loops with if statements
Full predicated execution of vector instructions
using flag registers
Vectorization of reductions and FFTs
Instructions for simple, intra-register
permutations
Automatic for reductions, manual (or StreamIT)
for FFTs
Vectorization of loops with break statements
Software speculation support for vector loads
OS development
MMU-based virtual memory
OS performance
Dirty and valid bits for registers to reduce
context switch overhead

14
HW Resources Visible to Software
Vector IRAM
Pentium III

Software (applications/compiler/OS) can control
Main memory, registers, execution datapaths

15
VIRAM Chip Statistics
Technology IBM SA-27E, 0.18um CMOS, 6 layers of copper Deep trench DRAM cell, full speed logic
Area 270 mm2 65 mm2 logic, 140 mm2 for DRAM
Transistors 130 millions 7.5M logic, 122.5 DRAM
Supply 1.2V logic, 1.8V DRAM, 3.3V I/O
Clock 200 MHz
Power 2W 0.5W MIPS core, 1W vector unit, 0.5W DRAM-I/O
Package 304-lead quad ceramic package (125 signal I/Os)
Crossbar BW 12.8 Gbytes/s per direction (load or store, peak)
Peak Performance Integer wo. madd 1.6/3.2/6.4 Gops (64b/32b/16b) Integer w. madd 3.2/6.4/12.8 Gops (64b/32b/16b) FP 1.6 Gflops (32b, wo. madd)
16
VIRAM Design Statistics
RTL model 170K lines of Verilog
Design Methodology Synthesized MIPS core, vector unit control, FP datapath Full-custom vector reg. file, crossbar, integer datapaths Macros DRAM, SRAM for caches
IP Sources UC Berkeley (Vector coprocessor, crossbar, I/O) MIPS Technologies (MIPS core) IBM (DRAM/SRAM macros) MIT (FP Datapath)
Verification 566K lines of directed tests (9.8M lines of assembly) 4 months of random testing on 20 linux workstations
Design team 5 graduate students
Status Place route, chip assembly
Tape-out October, 2002
Design time 2.5 years
17
VIRAM Chip

Taped out to IBM in October 02
Received wafers in June 2003.
Chips were thinned, diced, and packaged.
Parts were sent to ISI, who produced test boards.

MIPS
4 64-bit Vector Lanes
I/O
18
Demonstration System

Based on the MIPS Malta development board
PCI, Ethernet, AMR, IDE, USB, CompactFlash,
parallel, serial
VIRAM daughter-card
Designed at ISI-East
VIRAM processor
Galileo GT64120 chipset
1 DIMM slot for external DRAM
Software support and OS
Monitor utility for debugging
Modified version of MIPS Linux

19
Benchmarks for Scientific Problems

Dense and Sparse Matrix-vector multiplication
Compare to tuned codes on conventional machines
Transitive-closure (small large data set)
On a dense graph representation
NSA Giga-Updates Per Second (GUPS, 16-bit
64-bit)
Fetch-and-increment a stream of random
addresses
Sparse matrix-vector product
Order 10000, nonzeros 177820
Computing a histogram
Used for image processing of a 16-bit greyscale
image 1536 x 1536
2 algorithms 64-elements sorting kernel
privatization
Also used in sorting
2D unstructured mesh adaptation
initial grid 4802 triangles, final grid 24010

20
Sparse MVM Performance

Performance is matrix-dependent lp matrix
compiled for VIRAM using independent pragma
sparse column layout
Sparsity-optimized for other machines
sparse row (or blocked row) layout

MFLOPS
21
Power and Performance on BLAS-2

100x100 matrix vector multiplication (column
layout)
VIRAM result compiled, others hand-coded or Atlas
optimized
VIRAM performance improves with larger matrices
VIRAM power includes on-chip main memory
8-lane version of VIRAM nearly doubles MFLOPS

22
Performance Comparison

IRAM designed for media processing
Low power was a higher priority than high
performance
IRAM (at 200MHz) is better for apps with
sufficient parallelism

23
Power Efficiency

Same data on a log plot
Includes both low power processors (Mobile PIII)
The same picture for operations/cycle

24
Which Problems are Limited by Bandwidth?

What is the bottleneck in each case?
Transitive and GUPS are limited by bandwidth
(near 6.4GB/s peak)
SPMV and Mesh limited by address generation and
bank conflicts
For Histogram there is insufficient parallelism

25
Summary of 1-PIM Results

Programmability advantage
All vectorized by the VIRAM compiler (Cray
vectorizer)
With restructuring and hints from programmers
Performance advantage
Large on applications limited only by bandwidth
More address generators/sub-banks would help
irregular performance
Performance/Power advantage
Over both low power and high performance
processors
Both PIM and data parallelism are key

26
Alternative VIRAM Designs

VIRAM-4Lane
4 lanes, 8 Mbytes
190 mm2
3.2 Gops at 200MHz

VIRAM-2Lanes 2 lanes, 4 Mbytes 120 mm2 1.6
Gops at 200MHz
VIRAM-Lite 1 lanes, 2 Mbytes 60 mm2 0.8 Gops
at 200MHz
27
Compiled Multimedia Performance
integer
floating-point

Single executable for multiple implementations
Linear scaling with number of lanes
Remember, this is a 200MHz, 2W processor

28
Third Party Comparison (I)
VIRAM
Imagine
Imagine
VIRAM
VIRAM
Imagine
PPC-G4
Pentium III
Pentium III
PPC-G4
Pentium III
PPC-G4
29
Third Party Comparison (II)
VIRAM
VIRAM
Imagine
VIRAM
Imagine
Imagine
PPC-G4
Pentium III
PPC-G4
Pentium III
PPC-G4
Pentium III
30
Vectors VS. SIMD or VLIW

SIMD
Short, fixed-length, vector extensions
Require wide issue or ISA change to scale
They dont support vector memory accesses
Difficult to compile for
Performance wasted for pack/unpack, shifts,
rotates
VLIW
Architecture for instruction level parallelism
Orthogonal to vectors for data parallelism
Inefficient for data parallelism
Large code size (3X for IA-64?)
Extra work for software (scheduling more
instructions)
Extra work for hardware (decode more
instructions)

31
Vector Vs. Wide Word SIMD Example

Vector instruction sets have
Strided and scatter/gather load/store operations
SIMD extensions load contiguous memory
Implementation-independent vector length
SIMD extensions change ISA with bit wide in
hardware
Simple example conversion from RGB to YUV
Thanks to Christoforos Kozyrakis
Y ( 9798R 19235G 3736B) / 32768
U (-4784R - 9437G 4221B) / 32768
128
V (20218R 16941G 3277B) / 32768
128

32
VIRAM Code

RGBtoYUV
vlds.u.b r_v, r_addr, stride3, addr_inc
load R
vlds.u.b g_v, g_addr, stride3, addr_inc
load G
vlds.u.b b_v, b_addr, stride3, addr_inc
load B
xlmul.u.sv o1_v, t0_s, r_v
calculate Y
xlmadd.u.sv o1_v, t1_s, g_v
xlmadd.u.sv o1_v, t2_s, b_v
vsra.vs o1_v, o1_v, s_s
xlmul.u.sv o2_v, t3_s, r_v
calculate U
xlmadd.u.sv o2_v, t4_s, g_v
xlmadd.u.sv o2_v, t5_s, b_v
vsra.vs o2_v, o2_v, s_s
vadd.sv o2_v, a_s, o2_v
xlmul.u.sv o3_v, t6_s, r_v
calculate V
xlmadd.u.sv o3_v, t7_s, g_v
xlmadd.u.sv o3_v, t8_s, b_v
vsra.vs o3_v, o3_v, s_s
vadd.sv o3_v, a_s, o3_v
vsts.b o1_v, y_addr, stride3, addr_inc
store Y

33
MMX Code (1)

RGBtoYUV
movq mm1, eax
pxor mm6, mm6
movq mm0, mm1
psrlq mm1, 16
punpcklbw mm0, ZEROS
movq mm7, mm1
punpcklbw mm1, ZEROS
movq mm2, mm0
pmaddwd mm0, YR0GR
movq mm3, mm1
pmaddwd mm1, YBG0B
movq mm4, mm2
pmaddwd mm2, UR0GR
movq mm5, mm3
pmaddwd mm3, UBG0B
punpckhbw mm7, mm6
pmaddwd mm4, VR0GR
paddd mm0, mm1

paddd mm4, mm5
movq mm5, mm1
psllq mm1, 32
paddd mm1, mm7
punpckhbw mm6, ZEROS
movq mm3, mm1
pmaddwd mm1, YR0GR
movq mm7, mm5
pmaddwd mm5, YBG0B
psrad mm0, 15
movq TEMP0, mm6
movq mm6, mm3
pmaddwd mm6, UR0GR
psrad mm2, 15
paddd mm1, mm5
movq mm5, mm7
pmaddwd mm7, UBG0B
psrad mm1, 15
pmaddwd mm3, VR0GR

34
MMX Code (2)

paddd mm6, mm7
movq mm7, mm1
psrad mm6, 15
paddd mm3, mm5
psllq mm7, 16
movq mm5, mm7
psrad mm3, 15
movq TEMPY, mm0
packssdw mm2, mm6
movq mm0, TEMP0
punpcklbw mm7, ZEROS
movq mm6, mm0
movq TEMPU, mm2
psrlq mm0, 32
paddw mm7, mm0
movq mm2, mm6
pmaddwd mm2, YR0GR
movq mm0, mm7
pmaddwd mm7, YBG0B

movq mm4, mm6
pmaddwd mm6, UR0GR
movq mm3, mm0
pmaddwd mm0, UBG0B
paddd mm2, mm7
pmaddwd mm4,
pxor mm7, mm7
pmaddwd mm3, VBG0B
punpckhbw mm1,
paddd mm0, mm6
movq mm6, mm1
pmaddwd mm6, YBG0B
punpckhbw mm5,
movq mm7, mm5
paddd mm3, mm4
pmaddwd mm5, YR0GR
movq mm4, mm1
pmaddwd mm4, UBG0B
psrad mm0, 15

35
MMX Code (3)

pmaddwd mm7, UR0GR
psrad mm3, 15
pmaddwd mm1, VBG0B
psrad mm6, 15
paddd mm4, OFFSETD
packssdw mm2, mm6
pmaddwd mm5, VR0GR
paddd mm7, mm4
psrad mm7, 15
movq mm6, TEMPY
packssdw mm0, mm7
movq mm4, TEMPU
packuswb mm6, mm2
movq mm7, OFFSETB
paddd mm1, mm5
paddw mm4, mm7
psrad mm1, 15
movq ebx, mm6
packuswb mm4,

movq ecx, mm4
packuswb mm5, mm3
add ebx, 8
add ecx, 8
movq edx, mm5
dec edi
jnz RGBtoYUV

36
Summary

Combination of Vectors and PIM
Simple execution model for hardware pushes
complexity to compiler
Low power/footprint/etc.
PIM provides bandwidth needed by vectors
Vectors hid latency effectively
Programmability
Programmable from high level language
More compact instruction stream
Works well for
Applications with fine-grained data parallelism
Memory intensive problems
Both scientific and multimedia applications

37
The End
38
Algorithm Space
Search
Two-sided dense linear algebra
FFTs
Grobner Basis (Symbolic LU)
Sorting
Reuse
Sparse iterative solvers
Asynchronous discrete even simulation
One-sided dense linear algebra
Sparse direct solvers
Regularity
39
VIRAM Overview

MIPS core (200 MHz)
Single-issue, 8 Kbyte ID caches
Vector unit (200 MHz)
32 64b elements per register
256b datapaths, (16b, 32b, 64b ops)
4 address generation units
Main memory system
13 MB of on-chip DRAM in 8 banks
12.8 GBytes/s peak bandwidth
Typical power consumption 2.0 W
Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add
1.6 Gflops (single-precision)
Fabrication by IBM
Tape-out in O(1 month)

40
Benchmarks for Scientific Problems

Dense Matrix-vector multiplication
Compare to hand-tuned codes on conventional
machines
Transitive-closure (small large data set)
On a dense graph representation
NSA Giga-Updates Per Second (GUPS, 16-bit
64-bit)
Fetch-and-increment a stream of random
addresses
Sparse matrix-vector product
Order 10000, nonzeros 177820
Computing a histogram
Used for image processing of a 16-bit greyscale
image 1536 x 1536
2 algorithms 64-elements sorting kernel
privatization
Also used in sorting
2D unstructured mesh adaptation
initial grid 4802 triangles, final grid 24010

41
Power and Performance on BLAS-2

100x100 matrix vector multiplication (column
layout)
VIRAM result compiled, others hand-coded or Atlas
optimized
VIRAM performance improves with larger matrices
VIRAM power includes on-chip main memory
8-lane version of VIRAM nearly doubles MFLOPS

42
Performance Comparison

IRAM designed for media processing
Low power was a higher priority than high
performance
IRAM (at 200MHz) is better for apps with
sufficient parallelism

43
Power Efficiency

Huge power/performance advantage in VIRAM from
both
PIM technology
Data parallel execution model (compiler-controlled
)

44
Power Efficiency

Same data on a log plot
Includes both low power processors (Mobile PIII)
The same picture for operations/cycle

45
Which Problems are Limited by Bandwidth?

What is the bottleneck in each case?
Transitive and GUPS are limited by bandwidth
(near 6.4GB/s peak)
SPMV and Mesh limited by address generation and
bank conflicts
For Histogram there is insufficient parallelism

46
Summary of 1-PIM Results

Programmability advantage
All vectorized by the VIRAM compiler (Cray
vectorizer)
With restructuring and hints from programmers
Performance advantage
Large on applications limited only by bandwidth
More address generators/sub-banks would help
irregular performance
Performance/Power advantage
Over both low power and high performance
processors
Both PIM and data parallelism are key

47
Analysis of a Multi-PIM System

Machine Parameters
Floating point performance
PIM-node dependent
Application dependent, not theoretical peak
Amount of memory per processor
Use 1/10th Algorithm data
Communication Overhead
Time processor is busy sending a message
Cannot be overlapped
Communication Latency
Time across the network (can be overlapped)
Communication Bandwidth
Single node and bisection
Back-of-the envelope calculations !

48
Real Data from an Old Machine (T3E)

UPC uses a global address space
Non-blocking remote put/get model
Does not cache remote data

49
Running Sparse MVM on a Pflop PIM

1 GHz 8 pipes 8 ALUs/Pipe 64 GFLOPS/node
peak
8 Address generators limit performance to 16
Gflops
500ns latency, 1 cycle put/get overhead, 100
cycle MP overhead
Programmability differences too packing vs.
global address space

50
Effect of Memory Size

For small memory nodes or smaller problem sizes
Low overhead is more important
For large memory nodes and large problems packing
is better

51
Conclusions

Performance advantage for PIMS depends on
application
Need fine-grained parallelism to utilize on-chip
bandwidth
Data parallelism is one model with the usual
trade-offs
Hardware and programming simplicity
Limited expressibility
Largest advantages for PIMS are power and
packaging
Enables Peta-scale machine
Multiprocessor PIMs should be easier to program
At least at scale of current machines (Tflops)
Can we bget rid of the current programming model
hierarchy?

52
Benchmarks

Kernels
Designed to stress memory systems
Some taken from the Data Intensive Systems
Stressmarks
Unit and constant stride memory
Dense matrix-vector multiplication
Transitive-closure
Constant stride
FFT
Indirect addressing
NSA Giga-Updates Per Second (GUPS)
Sparse Matrix Vector multiplication
Histogram calculation (sorting)
Frequent branching a well and irregular memory
acess
Unstructured mesh adaptation

53
Conclusions and VIRAM Future Directions

VIRAM outperforms Pentium III on Scientific
problems
With lower power and clock rate than the Mobile
Pentium
Vectorization techniques developed for the Cray
PVPs applicable.
PIM technology provides low power, low cost
memory system.
Similar combination used in Sony Playstation.
Small ISA changes can have large impact
Limited in-register permutations sped up 1K FFT
by 5x.
Memory system can still be a bottleneck
Indexed/variable stride costly, due to address
generation.
Future work
Ongoing investigations into impact of lanes,
subbanks
Technical paper in preparation expect
completion 09/01
Run benchmark on real VIRAM chips
Examine multiprocessor VIRAM configurations

54
Management Plan

Roles of different groups and PIs
Senior researchers working on particular class of
benchmarks
Parry sorting and histograms
Sherry sparse matrices
Lenny unstructured mesh adaptation
Brian simulation
Jin and Hyun specific benchmarks
Plan to hire additional postdoc for next year
(focus on Imagine)
Undergrad model used for targeted benchmark
efforts
Plan for using computational resources at NERSC
Few resourced used, except for comparisons

55
Future Funding Prospects

FY2003 and beyond
DARPA initiated DIS program
Related projects are continuing under Polymorphic
Computing
New BAA coming in High Productivity Systems
Interest from other DOE labs (LANL) in general
problem
General model
Most architectural research projects need
benchmarking
Work has higher quality if done by people who
understand apps.
Expertise for hardware projects is different
system level design, circuit design, etc.
Interest from both IRAM and Imagine groups show
level of interest

56
Long Term Impact

Potential impact on Computer Science
Promote research of new architectures and
micro-architectures
Understand future architectures
Preparation for procurements
Provide visibility of NERSC in core CS research
areas
Correlate applications DOE vs. large market
problems
Influence future machines through research
collaborations

57
Benchmark Performance on IRAM Simulator

IRAM (200 MHz, 2 W) versus Mobile Pentium III
(500 MHz, 4 W)

58
Project Goals for FY02 and Beyond

Use established data-intensive scientific
benchmarks with other emerging architectures
IMAGINE (Stanford Univ.)
Designed for graphics and image/signal processing
Peak 20 GLOPS (32-bit FP)
Key features vector processing, VLIW, a
streaming memory system. (Not a PIM-based
design.)
Preliminary discussions with Bill Dally.
DIVA (DARPA-sponsored USC/ISI)
Based on PIM smart memory design, but for
multiprocessors
Move computation to data
Designed for irregular data structures and
dynamic databases.
Discussions with Mary Hall about benchmark
comparisons

59
Media Benchmarks

FFT uses in-register permutations, generalized
reduction
All others written in C with Cray vectorizing
compiler

60
Integer Benchmarks

Strided access important, e.g., RGB
narrow types limited by address generation
Outer loop vectorization and unrolling used
helps avoid short vectors
spilling can be a problem

61
Status of benchmarking software release
Optimized vector histogram code
Optimized
Optimized GUPS inner loop
GUPS Docs
Pointer Jumping w/Update
Vector histogram code generator
GUPS C codes
Conjugate Gradient (Matrix)
Neighborhood
Pointer Jumping
Transitive
Field
Standard random number generator
Test cases (small and large working sets)
Build and test scripts (Makefiles, timing,
analysis, ...)
Unoptimized

Future work
Write more documentation, add better test cases
as we find them
Incorporate media benchmarks, AMR code, library
of frequently-used compiler flags pragmas

62
Status of benchmarking work

Two performance models
simulator (vsim-p), and trace analyzer (vsimII)
Recent work on vsim-p
Refining the performance model for
double-precision FP performance.
Recent work on vsimII
Making the backend modular
Goal Model different architectures w/ same ISA.
Fixing bugs in the memory model of the VIRAM-1
backend.
Better comments in code for better
maintainability.
Completing a new backend for a new decoupled
cluster architecture.

63
Comparison with Mobile Pentium

GUPS VIRAM gets 6x more GUPS

Data element width 16 bit 32 bit 64 bit
Mobile Pentium GUPS .045 .046 .036
VIRAM GUPS .295 .295 .244
Transitive
Pointer
Update
VIRAM30-50 faster than P-III
Ex. time for VIRAM rises much more slowly w/ data
size than for P-III
64
Sparse CG

Solve Ax b Sparse matrix-vector
multiplication dominates.
Traditional CRS format requires
Indexed load/store for X/Y vectors
Variable vector length, usually short
Other formats for better vectorization
CRS with narrow band (e.g., RCM ordering)
Smaller strides for X vector
Segmented-Sum (Modified the old code developed
for Cray PVP)
Long vector length, of same size
Unit stride
ELL format make all rows the same length by
padding zeros
Long vector length, of same size
Extra flops

65
SMVM Performance

DIS matrix N 10000, M 177820 ( 17 nonzeros
per row)
IRAM results (MFLOPS)
Mobile PIII (500 MHz)
CRS 35 MFLOPS

SubBanks 1 2 4 8
CRS 91 106 109 110
CRS banded 110 110 110 110
SEG-SUM 135 154 163 165
ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)
66
2D Unstructured Mesh Adaptation

Powerful tool for efficiently solving
computational problems with evolving physical
features (shocks, vortices, shear layers, crack
propagation)
Complicated logic and data structures
Difficult to achieve high efficiently
Irregular data access patterns (pointer chasing)
Many conditionals / integer intensive
Adaptation is tool for making numerical solution
cost effective
Three types of element subdivision

67
Vectorization Strategy and Performance Results