# CS 594 Spring 2002 Lecture 3: - PowerPoint PPT Presentation

PPT – CS 594 Spring 2002 Lecture 3: PowerPoint presentation | free to download - id: 9d632-YjkxN

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## CS 594 Spring 2002 Lecture 3:

Description:

### threads communicate implicity by writing and reading shared variables ... Instructions from different threads can be interleaved arbitrarily ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 83
Provided by: JackDo3
Category:
Tags:
Transcript and Presenter's Notes

Title: CS 594 Spring 2002 Lecture 3:

1
CS 594 Spring 2002 Lecture 3
• Jack Dongarra
• University of Tennessee

2
Plan For Today
• Prof. Vassil Alexandrov, Reading University,
England
• Monte Carlo Methods
• Homework Assignment 1
• Lecture Parallel Architectures and Programming

3
Homework 1
• Need to report what level of optimization is
being used and perhaps what effects optimization
has on the performance.
• Need to describe how the data is generated. Is
the computation done with 64 bit floating point?
What kind of random number generator.
• Using data where each entry is 1.0 or 0.0 may be
a problem.
• Verify the result by eye is not a good way to
check the correctness of your numerical results.
For the norm, use an orthogonal vector
xx/norm(x) then the norm(x) 1.
• Need to know the peak performance of the machine.
• Need to know the number of operations that are
being performed. How to determine analytical or
quantitative?
• Resolution of the timer was a big problem. Need
to know what the granularity of the time we can
measure accurately.
• Some people repeated the test case over a number
of times to get an average. This is a problem
since it doesnt take into account the effects of
cache. You are in effect working with a hot
cache after the first call. How can this be
overcome?
• Because of inaccuracies in the timing itself,
what about repeating the experiments and looking
at error bars?
• Cache is something that is inherited in the
machine, cant increase.
• Mflop/s not Mflops, its a rate of execution.

4
Homework- Norm, MV, and MM
• Results
• How timed?
• Why
• How to increase performance
• Graphs
• Check results
• PIII 933 MHz
• gcc O3

5
Types of Parallel Computers
• The simplest and most useful way to classify
modern parallel computers is by their memory
model
• shared memory
• distributed memory

6
Shared vs. Distributed Memory
P
P
P
P
P
P
Shared memory - single address space. All
memory. (Ex SGI Origin, Sun E10000)
BUS
Memory
P
P
P
P
P
P
Distributed memory - each processor has its own
local memory. Must do message passing to exchange
data between processors. (Ex CRAY T3E, IBM SP,
clusters)
M
M
M
M
M
M
Network
7
Shared Memory UMA vs. NUMA
Uniform memory access (UMA) Each processor has
multiprocessors (Sun E10000)
P
P
P
P
P
P
BUS
Memory
P
P
P
P
P
P
P
P
Non-uniform memory access (NUMA) Time for memory
access depends on location of data. Local access
is faster than non-local access. Easier to scale
than SMPs (SGI Origin)
BUS
BUS
Memory
Memory
Network
8
Standard Uniprocessor Memory Hierarchy
• Intel Pentium III 1.135 GHz processor (Model 11)
• 16 Kbytes of 4 way assoc. L1 instruction cache
with 32 byte lines.
• 16 Kbytes of 4 way assoc. L1 data cache with 32
byte lines.
• 512 Kbytes of 8 way assoc. L2 cache 32 byte
lines.

9
Distributed Memory MPPs vs. Clusters
• Processors-memory nodes are connected by some
type of interconnect network
• Massively Parallel Processor (MPP) tightly
integrated, single system image.
• Cluster individual computers connected by s/w

Interconnect Network
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
10
Processors, Memory, Networks
• Both shared and distributed memory systems have
• processors now generally commodity RISC
processors
• memory now generally commodity DRAM
• network/interconnect between the processors and
memory (bus, crossbar, fat tree, torus,
hypercube, etc.)

11
Interconnect-Related Terms
• Latency How long does it take to start sending a
"message"? Measured in microseconds.
• (Also in processors How long does it take to
output results of some operations, such as
floating point add, divide etc., which are
pipelined?)
• Bandwidth What data rate can be sustained once
the message is started? Measured in Mbytes/sec.

12
Interconnect-Related Terms
• Topology the manner in which the nodes are
connected.
• Best choice would be a fully connected network
(every processor to every other). Unfeasible for
cost and scaling reasons.
• Instead, processors are arranged in some
variation of a grid, torus, or hypercube.

2-d mesh
3-d hypercube
2-d torus
13
Shared Memory / Local Memory
• Usually think in terms of the hardware
• What about a software model?
• How about something that works like cache?
• Logically shared memory

14
Parallel Programming Models
• Control
• how is parallelism created
• what orderings exist between operations
• how do different threads of control synchronize
• Naming
• what data is private vs. shared
• how logically shared data is accessed or
communicated
• Set of operations
• what are the basic operations
• what operations are considered to be atomic
• Cost
• how do we account for the cost of each of the
above

15
Trivial Example
• Parallel Decomposition
• Each evaluation and each partial sum is a task
• Assign n/p numbers to each of p procs
• each computes independent private results and
partial sum
• one (or all) collects the p partial sums and
computes the global sum
• gt Classes of Data
• Logically Shared
• the original n numbers, the global sum
• Logically Private
• the individual function evaluations
• what about the individual partial sums?

16
Programming Model 1
• program consists of a collection of threads of
control,
• each with a set of private variables
• e.g., local variables on the stack
• collectively with a set of shared variables
• e.g., static variables, shared common blocks,
global heap
• threads communicate implicity by writing and
• threads coordinate explicitly by synchronization
operations on shared variables
• locks, semaphores
• Like concurrent programming on
uniprocessor

A
x ...
Shared
y ..x ...
Private
i res s
17
Machine Model 1
• A shared memory machine
• Processors all connected to a large shared memory
• Local memory is not (usually) part of the
hardware
• Sun, DEC, Intel SMPs (Symmetric
multiprocessors) in Millennium SGI Origin
• Cost much cheaper to cache than main memory
• Machine model 1a A Shared Address Space Machine
• replace caches by local memories (in abstract
machine model)
• this affects the cost model -- repeatedly
accessed data should be copied
• Cray T3E

18
Shared Memory code for computing a sum
Thread 2 s 0 initially local_s2 0
for i n/2, n-1 local_s2 local_s2
f(Ai) s s local_s2
Thread 1 s 0 initially local_s1 0
for i 0, n/2-1 local_s1 local_s1
f(Ai) s s local_s1
What could go wrong?
19
Pitfall and solution via synchronization
• Pitfall in computing a global sum s local_s1
local_s2

reg s slocal_s1 local_s1, in reg
store s from reg to mem
mem to reg initially 0 s slocal_s2
local_s2, in reg store s from reg to mem
Time
• Instructions from different threads can be
interleaved arbitrarily
• What can final result s stored in memory be?
• Race Condition
• Possible solution Mutual Exclusion with Locks

store s unlock
store s unlock
• Locks must be atomic (execute completely without
interruption)

20
Programming Model 2
• Message Passing
• program consists of a collection of named
processes
• local variables, static variables, common blocks,
heap
• processes communicate by explicit data transfers
• matching pair of send receive by source and
dest. proc.
• coordination is implicit in every communication
event
• logically shared data is partitioned over local
processes
• Like distributed programming
• Program with standard
• libraries MPI, PVM

A
A
n
0
21
Machine Model 2
• A distributed memory machine
• Cray T3E, IBM SP2, Clusters
• Processors all connected to own memory (and
caches)
• cannot directly access another processors memory
• Each node has a network interface (NI)
• all communication and synchronization done
through this

22
Computing s x(1)x(2) on each processor
• First possible solution

Processor 2 receive xremote, proc1 send
xlocal, proc1 xlocal x(2) s
xlocal xremote
Processor 1 send xlocal, proc2
xlocal x(1) receive xremote, proc2 s
xlocal xremote
• Second possible solution - what could go wrong?

Processor 1 send xlocal, proc2
xlocal x(1) receive xremote, proc2 s
xlocal xremote
Processor 2 send xlocal, proc1
xlocal x(2) receive xremote, proc1 s
xlocal xremote
• What if send/receive act like the telephone
system? The post office?

23
Programming Model 3
• Data Parallel
• Single sequential thread of control consisting of
parallel operations
• Parallel operations applied to all (or defined
subset) of a data structure
• Communication is implicit in parallel operators
and shifted data structures
• Elegant and easy to understand and reason about
• Not all problems fit this model
• Like marching in a regiment

A array of all data fA f(A) s sum(fA)
s
• Think of Matlab

24
Model 3
• Vector Computing
• One instruction executed across all the data in a
pipelined fashion
• Parallel operations applied to all (or defined
subset) of a data structure
• Communication is implicit in parallel operators
and shifted data structures
• Elegant and easy to understand and reason about
• Not all problems fit this model
• Like marching in a regiment

A array of all data fA f(A) s sum(fA)
s
• Think of Matlab

25
Machine Model 3
• An SIMD (Single Instruction Multiple Data)
machine
• A large number of small processors
• A single control processor issues each
instruction
• each processor executes the same instruction
• some processors may be turned off on any
instruction

control processor
. . .
interconnect
• Machines not popular (CM2), but programming model
is
• implemented by mapping n-fold parallelism to p
processors
• mostly done in the compilers (HPF High
Performance Fortran)

26
Machine Model 4
• Since small shared memory machines (SMPs) are the
fastest commodity machine, why not build a larger
machine by connecting many of them with a
network?
• CLUMP Cluster of SMPs
• Shared memory within one SMP, message passing
outside
• Clusters, ASCI Red (Intel), ...
• Programming model?
• Treat machine as flat, always use message
passing, even within SMP (simple, but ignore
important part of memory hierarchy)
• Expose two layers shared memory (OpenMP) and
message passing (MPI) higher performance, but
ugly to program.

27
Programming Model 5
• Bulk Synchronous Processing (BSP) L. Valiant
• Used within the message passing or shared memory
models as a programming convention
• Phases separated by global barriers
• Compute phases all operate on local data (in
distributed memory)
• Communication phases all participate in
rearrangement or reduction of global data
• Generally all doing the same thing in a phase
• all do f, but may all do different things within
f
• Simplicity of data parallelism without
restrictions

28
Summary so far
• Historically, each parallel machine was unique,
along with its programming model and programming
language
over with each new kind of machine - ugh
• Now we distinguish the programming model from the
underlying machine, so we can write portably
correct code, that runs on many machines
• MPI now the most portable option, but can be
tedious
• Writing portably fast code requires tuning for
the architecture
• Algorithm design challenge is to make this
process easy
• Example picking a blocksize, not rewriting whole
algorithm

29
Recap
• Parallel Comp. Architecture driven by familiar
technological and economic forces
• application/platform cycle, but focused on the
most demanding applications
• hardware/software learning curve
• More attractive than ever because best building
block - the microprocessor - is also the fastest
BB.
• History of microprocessor architecture is
parallelism
• translates area and denisty into performance
• The Future is higher levels of parallelism
• Parallel Architecture concepts apply at many
levels
• Communication also on exponential curve
• gt Quantitative Engineering approach

Speedup
30
History
• Parallel architectures tied closely to
programming models
• Divergent architectures, with no predictable
pattern of growth.
• Mid 80s renaissance

Application Software
System Software
Systolic Arrays
SIMD
Architecture
Message Passing
Dataflow
Shared Memory
31
Programming Model
• Conceptualization of the machine that programmer
uses in coding applications
• How parts cooperate and coordinate their
activities
• Specifies communication and synchronization
operations
• Multiprogramming
• Independent jobs, no communication or synch. at
program level
• like bulletin board
• Message passing
• like letters or phone calls, explicit point to
point
• Data parallel
• more regimented, global actions on data
• Implemented with shared address space or message
passing

32
Economics
• Commodity microprocessors not only fast but CHEAP
• Development costs tens of millions of dollars
• BUT, many more are sold compared to
supercomputers
• Crucial to take advantage of the investment, and
use the commodity building block
• Multiprocessors being pushed by software vendors
(e.g. database) as well as hardware vendors
• Standardization makes small, bus-based SMPs
commodity
• Desktop few smaller processors versus one larger
one?
• Multiprocessor on a chip?

33
Performance Numbers on RISC Processors
• Using Linpack Benchmark

34
Consider Scientific Supercomputing
• Proving ground and driver for innovative
architecture and techniques
• Market smaller relative to commercial as MPs
become mainstream
• Dominated by vector machines starting in 70s
• Microprocessors have made huge gains in
floating-point performance
• high clock rates
• pipelined floating point units (e.g.,
• instruction-level parallelism
• effective use of caches (e.g., automatic
blocking)
• Plus economics
• Large-scale multiprocessors replace vector
supercomputers

35
Performance Development
60G - 400 M7.2 Tflop/s 94Gflop/s, Schwab 24,
1/2 per year, 394 gt 100 Gf, faster than Moores
law
36
Performance Development
My Laptop
Entry 1 T 2005 and 1 P 2010
37
Architectures
Constellation of p/n ? n
38
Chip Technology
39
Manufacturer
IBM 32, HP 30, SGI 8, Cray 8, SUN 6, Fuji
4, NEC 3, Hitachi 3
40
High-Performance Computing Directions
Beowulf-class PC Clusters
Definition
• COTS PC Nodes
• Pentium, Alpha, PowerPC, SMP
• COTS LAN/SAN Interconnect
• Ethernet, Myrinet, Giganet, ATM
• Open Source Unix
• Linux, BSD
• Message Passing Computing
• MPI, PVM
• HPF
• Best price-performance
• Low entry-level cost
• Just-in-place configuration
• Vendor invulnerable
• Scalable
• Rapid technology tracking

Enabled by PC hardware, networks and operating
system achieving capabilities of scientific
workstations at a fraction of the cost and
availability of industry standard message passing
libraries. However, much more of a contact sport.
41
• Peak performance
• Interconnection
• http//clusters.top500.org
• Benchmark results to follow in the coming months

42
Distributed and Parallel Systems
Distributed systems hetero- geneous
Massively parallel systems homo- geneous
Parallel Dist mem
Grid Computing
ASCI Tflops
Berkley NOW
SETI_at_home
SNL Cplant
Beowulf
Entropia
• Gather (unused) resources
• Steal cycles
• System SW manages resources
• 10 - 20 overhead is OK
• Resources drive applications
• Time to completion is not critical
• Time-shared
• Bounded set of resources
• Apps grow to consume all cycles
• Application manages resources
• System SW gets in the way
• Apps drive purchase of equipment
• Real-time constraints
• Space-shared

43
Different Parallel Architectures
• Parallel computing single systems with many
processors working on same problem
• Distributed computing many systems loosely
coupled by a scheduler to work on related
problems
• Grid Computing many systems tightly coupled by
software, perhaps geographically distributed, to
work together on single problems or on related
problems

44
Performance Improvements for Scientific Computing
Problems
45
Shared Memory gt Shared Addr. Space
• Bottom-up engineering factors
• Programming concepts
• Why its attactive.

46
• Memory capacity increased by adding modules
• I/O by controllers and devices
• For higher-throughput multiprogramming, or
parallel programs

47
Historical Development
• Mainframe approach
• Motivated by multiprogramming
• Extends crossbar used for Mem and I/O
• Processor cost-limited gt crossbar
• Bandwidth scales with p
• High incremental cost
• Minicomputer approach
• Almost all microprocessor systems have bus
• Motivated by multiprogramming, TP
• Used heavily for parallel computing
• Called symmetric multiprocessor (SMP)
• Latency larger than for uniprocessor
• Bus is bandwidth bottleneck
• caching is key coherence problem
• Low incremental cost

48
Shared Physical Memory
• Any processor can directly reference any memory
location
• Any I/O controller - any memory
• Operating system can run on any processor, or
all.
• OS uses shared memory to coordinate
• Communication occurs implicitly as result of

49
• Virtual-to-physical mapping can be established so
that processes shared portions of address space.
• User-kernel or multiple processes
• Popular approach to structuring OSs
• Now standard application capability (ex POSIX
• Natural extension of uniprocessors model
• conventional memory operations for communication
• special atomic operations for synchronization

50
• Add hoc parallelism used in system code
• Most parallel applications have structured SAS
• Same program on each processor
• shared variable X means the same thing to each

51
• All coherence and multiprocessing glue in
processor module
• Highly integrated, targeted at high volume
• Low latency and bandwidth

52
Engineering SUN Enterprise
• Proc mem card - I/O card
• 16 cards of either type
• All memory accessed over bus, so symmetric
• Higher bandwidth, higher latency bus

53
Scaling Up
M
M
M

Network
Network

M
M
M

P
P
P
P
P
P
Dance hall
Distributed memory
• Problem is interconnect cost (crossbar) or
bandwidth (bus)
• Dance-hall bandwidth still scalable, but lower
cost than crossbar
• latencies to memory uniform, but uniformly large
• Distributed memory or non-uniform memory access
(NUMA)
• Construct shared address space out of simple
message transactions across a general-purpose
• Caching shared (particularly nonlocal) data?

54
Engineering Cray T3E
• Scale up to 1024 processors, 480MB/s links
• Memory controller generates request message for
non-local references
• No hardware mechanism for coherence
• SGI Origin etc. provide this

55
Systolic Arrays
SIMD
Generic Architecture
Message Passing
Dataflow
Shared Memory
56
Message Passing Architectures
• Complete computer as building block, including
I/O
• Communication via explicit I/O operations
• Programming model
• direct access only to private address space
(local memory),
• communication via explicit messages
• High-level block diagram
• Communication integration?
• Mem, I/O, LAN, Cluster
• Easier to build and scale than SAS
• Programming model more removed from basic
hardware operations
• Library or OS intervention

57
Message-Passing Abstraction
• Send specifies buffer to be transmitted and
receiving process
• Recv specifies sending process and application
• Memory to memory copy, but need to name processes
• Optional tag on send and matching rule on receive
• User process names local data and entities in
process/tag space too
• In simplest form, the send/recv match achieves
pairwise synch event
• Other variants too
• Many overheads copying, buffer management,
protection

58
Evolution of Message-Passing Machines
• Early machines FIFO on each link
• HW close to prog. Model
• synchronous ops
• topology central (hypercube algorithms)

CalTech Cosmic Cube (Seitz, CACM Jan 95)
59
Diminishing Role of Topology
• DMA, enabling non-blocking ops
• Buffered by system at destination until recv
• Storeforward routing
• Diminishing role of topology
• Any-to-any pipelined routing
• node-network interface dominates communication
time
• Simplifies programming
• Allows richer design space
• grids vs hypercubes

Intel iPSC/1 -gt iPSC/2 -gt iPSC/860
H x (T0 n/B) vs T0 HD n/B
60
Example Intel Paragon
61
Building on the mainstream IBM SP-2
• Made out of essentially complete RS6000
workstations
• Network interface integrated in I/O bus (bw
limited by I/O bus)

62
Highly Parallel Supercomputing Where Are We?
• Performance
• Sustained performance has dramatically increased
during the last year.
• On most applications, sustained performance per
dollar now exceeds that of conventional
supercomputers. But...
• Conventional systems are still faster on some
applications.
• Languages and compilers
• Standardized, portable, high-level languages such
as OpenMP, HPF, PVM and MPI are available. But
...
• Initial HPF releases are not very efficient.
• Message passing programming is tedious and
hard to debug.
• Programming difficulty remains a major obstacle
to usage by mainstream scientist.

63
Achieving TeraFlops
• In 1991, 1 Gflop/s
• 1000 fold increase
• Architecture
• exploiting parallelism
• Processor, communication, memory
• Moores Law
• Algorithm improvements
• block-partitioned algorithms

64
Future Petaflops ( fl pt ops/s)
• dynamic redistribution of
• new language and constructs
• role of numerical libraries
• algorithm adaptation to hardware failure
• A Pflop for 1 second ? a typical workstation
computing for 1 year.
• From an algorithmic standpoint
• concurrency
• data locality
• latency sync
• floating point accuracy

65
Petaflop (1015 flop/s) Computers Within the Next
• Five basis design points
• Conventional technologies
• 4.8 GHz processor, 8000 nodes, each w/16
processors
• Processing-in-memory (PIM) designs
• Reduce memory access bottleneck
• Superconducting processor technologies
• Digital superconductor technology, Rapid
Single-Flux-Quantum (RSFQ) logic hybrid
• Special-purpose hardware designs
• Specific applications e.g. GRAPE Project in Japan
for gravitational force computations
• Schemes utilizing the aggregate computing power
of processors distributed on the web
• SETI_at_home 26 Tflop/s

66
Petaflops (1015 flop/s) Computer Today?
• 1 GHz processor (O(109) ops/s)
• 1 Million PCs
• 1B (1K each)
• 100 Mwatts
• 5 acres
• PC failure every second

67
Outline
• A little history
• IEEE floating point formats
• Error analysis
• Exception handling
• Using exception handling to go faster
• How to get extra precision cheaply
• Cray arithmetic - a pathological example
• Dangers of Parallel and Heterogeneous Computing

68
A little history
• Von Neumann and Goldstine - 1947
• Cant expect to solve most big ngt15 linear
systems without carrying many decimal digits
dgt8, otherwise the computed answer would be
completely inaccurate. - WRONG!
• Turing - 1949
• Carrying d digits is equivalent to changing the
input data in the d-th place and then solving
Axb. So if A is only known to d digits, the
answer is as accurate as the data deserves.
• Backward Error Analysis
• Rediscovered in 1961 by Wilkinson and publicized
• Starting in the 1960s- many papers doing backward
error analysis of various algorithms
• Many years where each machine did FP arithmetic
slightly differently
• Both rounding and exception handling differed
• Hard to write portable and reliable software
• Motivated search for industry-wide standard,
beginning late 1970s
• First implementation Intel 8087
• ACM Turing Award 1989 to W. Kahan for design of
the IEEE Floating Point Standards 754 (binary)
and 854 (decimal)
• Nearly universally implemented in general purpose
machines

69
Defining Floating Point Arithmetic
• Representable numbers
• Scientific notation /- d.dd x rexp
• sign bit /-
• radix r (usually 2 or 10, sometimes 16)
• significand d.dd (how many base-r digits d?)
• exponent exp (range?)
• others?
• Operations
• arithmetic ,-,x,/,...
• how to round result to fit in format
• comparison (lt, , gt)
• conversion between different formats
• short to long FP numbers, FP to integer
• exception handling
• what to do for 0/0, 2largest_number, etc.
• binary/decimal conversion
• for I/O, when radix not 10
• Language/library support for these operations

70
IEEE Floating Point Arithmetic Standard 754 -
Normalized Numbers
• Normalized Nonzero Representable Numbers -
1.dd x 2exp
• Macheps Machine epsilon 2-significand bits
relative error in each operation
• OV overflow threshold largest number
• UN underflow threshold smallest number
• - Zero -, significand and exponent all zero
• Why bother with -0 later

Format bits significand bits macheps
exponent bits exponent range ----------
-------- -----------------------
------------ --------------------
---------------------- Single 32
231 2-24 (10-7) 8
2-126 - 2127 (10-38) Double
64 521 2-53
(10-16) 11 2-1022 - 21023
(10-308) Double gt80 gt64
lt2-64(10-19) gt15 2-16382
- 216383 (10-4932) Extended (80 bits on all
Intel machines)
71
IEEE Floating Point Arithmetic Standard 754 -
Denorms
• Denormalized Numbers -0.dd x 2min_exp
• sign bit, nonzero significand, minimum exponent
• Fills in gap between UN and 0
• Underflow Exception
• occurs when exact nonzero result is less than
underflow threshold UN
• Ex UN/3
• return a denorm, or zero
• Why bother?
• Necessary so that following code never divides by
zero
• if (a ! b) then x a/(a-b)

72
IEEE Floating Point Arithmetic Standard 754 - -
Infinity
• - Infinity Sign bit, zero significand,
maximum exponent
• Overflow Exception
• occurs when exact finite result too large to
represent accurately
• Ex 2OV
• return - infinity
• Divide by zero Exception
• return - infinity 1/-0
• sign of zero important!
• Also return - infinity for
• 3infinity, 2infinity, infinityinfinity
• Result is exact, not an exception!

73
IEEE Floating Point Arithmetic Standard 754 - NAN
(Not A Number)
• NAN Sign bit, nonzero significand, maximum
exponent
• Invalid Exception
• occurs when exact result not a well-defined real
number
• 0/0
• sqrt(-1)
• infinity-infinity, infinity/infinity,
0infinity
• NAN 3
• NAN gt 3?
• Return a NAN in all these cases
• Two kinds of NANs
• Quiet - propagates without raising an exception
• Signaling - generate an exception when touched
• good for detecting uninitialized data

74
Error Analysis
• Basic error formula
• fl(a op b) (a op b)(1 d) where
• op one of ,-,,/
• d lt macheps
• assuming no overflow, underflow, or divide by
zero
• fl(x1x2x3x4) (x1x2)(1d1) x3(1d2)
x4(1d3)

• x1(1d1)(1d2)(1d3) x2(1d1)(1d2)(1d3)

• x3(1d2)(1d3) x4(1d3)
• x1(1e1)
x2(1e2) x3(1e3) x4(1e4)
• where each
ei lt 3macheps
• get exact sum of slightly changed summands
xi(1ei)
• Backward Error Analysis - algorithm called
numerically stable if it gives the exact result
for slightly changed inputs
• Numerical Stability is an algorithm design goal

75
Example polynomial evaluation using Horners rule
n
• Horners rule to evaluate p S ck xk
• p cn, for kn-1 down to 0, p xp ck
• Numerically Stable
• Apply to (x-2)9 x9 - 18x8 - 512

k0
76
Example polynomial evaluation (continued)
• (x-2)9 x9 - 18x8 - 512
• We can compute error bounds using
• fl(a op b)(a op b)(1d)

77
• What happens when the exact value is not a real
number, or is too small or too large to represent
accurately?
• You get an exception

78
Exception Handling
• What happens when the exact value is not a real
number, or too small or too large to represent
accurately?
• 5 Exceptions
• Overflow - exact result gt OV, too large to
represent
• Underflow - exact result nonzero and lt UN, too
small to represent
• Divide-by-zero - nonzero/0
• Invalid - 0/0, sqrt(-1),
• Inexact - you made a rounding error (very
common!)
• Possible responses
• Stop with error message (unfriendly, not default)
• Keep computing (default, but how?)

79
Exception Handling User Interface
• Each of the 5 exceptions has the following
features
• A sticky flag, which is set as soon as an
exception occurs
• The sticky flag can be reset and read by the user
• reset overflow_flag and invalid_flag
• perform a computation
• test overflow_flag and invalid_flag to see if
any exception occurred
• An exception flag, which indicate whether a trap
should occur
• Not trapping is the default
• Instead, continue computing returning a NAN,
infinity or denorm
• On a trap, there should be a user-writable
of the exceptional operation
• Trapping or precise interrupts like this are
rarely implemented for performance reasons.

80
Exploiting Exception Handling to Design Faster
Algorithms
• Quick with high probability
• Assumes exception handling done quickly
• Ex 1 Solving triangular system Txb
• Part of BLAS2 - highly optimized, but risky
• If T nearly singular, expect very large x, so
scale inside inner loop slow but low risk
• Use paradigm with sticky flags to detect nearly
singular T
• Up to 9x faster on Dec Alpha
• Ex 2 Computing eigenvalues, up to 1.5x faster on
CM-5
• Demmel/Li (www.cs.berkeley.edu/xiaoye)

1) Try fast, but possibly risky algorithm 2)
Quickly test for accuracy of answer (use
exception handling) 3) In rare case of
inaccuracy, rerun using slower low risk
algorithm
For k 1 to n d ak - s - bk2/d if d
lt tol, d -tol if d lt 0, count
For k 1 to n d ak - s - bk2/d ok to
divide by 0 count signbit(d)
vs.
81
Summary of Values Representable in IEEE FP
• - Zero
• Normalized nonzero numbers
• Denormalized numbers
• -Infinity
• NANs
• Signaling and quiet
• Many systems have only quiet

00
00
Not 0 or all 1s
anything
nonzero
00
1.1
00
1.1
nonzero
82
Hazards of Parallel and Heterogeneous Computing
• What new bugs arise in parallel floating point
programs?
• Ex 1 Nonrepeatability
• Makes debugging hard!
• Ex 2 Different exception handling
• Can cause programs to hang
• Ex 3 Different rounding (even on IEEE FP
machines)
• Can cause hanging, or wrong results with no
warning
• See www.netlib.org/lapack/lawns/lawn112.ps