CS 594 Spring 2002 Lecture 3: - PowerPoint PPT Presentation

Loading...

PPT – CS 594 Spring 2002 Lecture 3: PowerPoint presentation | free to download - id: 9d632-YjkxN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CS 594 Spring 2002 Lecture 3:

Description:

threads communicate implicity by writing and reading shared variables ... Instructions from different threads can be interleaved arbitrarily ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 83
Provided by: JackDo3
Learn more at: http://web.eecs.utk.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS 594 Spring 2002 Lecture 3:


1
CS 594 Spring 2002 Lecture 3
  • Jack Dongarra
  • University of Tennessee

2
Plan For Today
  • Prof. Vassil Alexandrov, Reading University,
    England
  • Monte Carlo Methods
  • Homework Assignment 1
  • Lecture Parallel Architectures and Programming

3
Homework 1
  • Need to report what level of optimization is
    being used and perhaps what effects optimization
    has on the performance.
  • Need to describe how the data is generated. Is
    the computation done with 64 bit floating point?
    What kind of random number generator.
  • Using data where each entry is 1.0 or 0.0 may be
    a problem.
  • Verify the result by eye is not a good way to
    check the correctness of your numerical results.
    For the norm, use an orthogonal vector
    xx/norm(x) then the norm(x) 1.
  • Need to know the peak performance of the machine.
  • Need to know the number of operations that are
    being performed. How to determine analytical or
    quantitative?
  • Resolution of the timer was a big problem. Need
    to know what the granularity of the time we can
    measure accurately.
  • Some people repeated the test case over a number
    of times to get an average. This is a problem
    since it doesnt take into account the effects of
    cache. You are in effect working with a hot
    cache after the first call. How can this be
    overcome?
  • Because of inaccuracies in the timing itself,
    what about repeating the experiments and looking
    at error bars?
  • Cache is something that is inherited in the
    machine, cant increase.
  • Mflop/s not Mflops, its a rate of execution.

4
Homework- Norm, MV, and MM
  • Results
  • How timed?
  • Why
  • How to increase performance
  • Graphs
  • Check results
  • PIII 933 MHz
  • gcc O3

5
Types of Parallel Computers
  • The simplest and most useful way to classify
    modern parallel computers is by their memory
    model
  • shared memory
  • distributed memory

6
Shared vs. Distributed Memory
P
P
P
P
P
P
Shared memory - single address space. All
processors have access to a pool of shared
memory. (Ex SGI Origin, Sun E10000)
BUS
Memory
P
P
P
P
P
P
Distributed memory - each processor has its own
local memory. Must do message passing to exchange
data between processors. (Ex CRAY T3E, IBM SP,
clusters)
M
M
M
M
M
M
Network
7
Shared Memory UMA vs. NUMA
Uniform memory access (UMA) Each processor has
uniform access to memory. Also known as symmetric
multiprocessors (Sun E10000)
P
P
P
P
P
P
BUS
Memory
P
P
P
P
P
P
P
P
Non-uniform memory access (NUMA) Time for memory
access depends on location of data. Local access
is faster than non-local access. Easier to scale
than SMPs (SGI Origin)
BUS
BUS
Memory
Memory
Network
8
Standard Uniprocessor Memory Hierarchy
  • Intel Pentium III 1.135 GHz processor (Model 11)
  • 16 Kbytes of 4 way assoc. L1 instruction cache
    with 32 byte lines.
  • 16 Kbytes of 4 way assoc. L1 data cache with 32
    byte lines.
  • 512 Kbytes of 8 way assoc. L2 cache 32 byte
    lines.

9
Distributed Memory MPPs vs. Clusters
  • Processors-memory nodes are connected by some
    type of interconnect network
  • Massively Parallel Processor (MPP) tightly
    integrated, single system image.
  • Cluster individual computers connected by s/w

Interconnect Network
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
10
Processors, Memory, Networks
  • Both shared and distributed memory systems have
  • processors now generally commodity RISC
    processors
  • memory now generally commodity DRAM
  • network/interconnect between the processors and
    memory (bus, crossbar, fat tree, torus,
    hypercube, etc.)

11
Interconnect-Related Terms
  • Latency How long does it take to start sending a
    "message"? Measured in microseconds.
  • (Also in processors How long does it take to
    output results of some operations, such as
    floating point add, divide etc., which are
    pipelined?)
  • Bandwidth What data rate can be sustained once
    the message is started? Measured in Mbytes/sec.

12
Interconnect-Related Terms
  • Topology the manner in which the nodes are
    connected.
  • Best choice would be a fully connected network
    (every processor to every other). Unfeasible for
    cost and scaling reasons.
  • Instead, processors are arranged in some
    variation of a grid, torus, or hypercube.

2-d mesh
3-d hypercube
2-d torus
13
Shared Memory / Local Memory
  • Usually think in terms of the hardware
  • What about a software model?
  • How about something that works like cache?
  • Logically shared memory

14
Parallel Programming Models
  • Control
  • how is parallelism created
  • what orderings exist between operations
  • how do different threads of control synchronize
  • Naming
  • what data is private vs. shared
  • how logically shared data is accessed or
    communicated
  • Set of operations
  • what are the basic operations
  • what operations are considered to be atomic
  • Cost
  • how do we account for the cost of each of the
    above

15
Trivial Example
  • Parallel Decomposition
  • Each evaluation and each partial sum is a task
  • Assign n/p numbers to each of p procs
  • each computes independent private results and
    partial sum
  • one (or all) collects the p partial sums and
    computes the global sum
  • gt Classes of Data
  • Logically Shared
  • the original n numbers, the global sum
  • Logically Private
  • the individual function evaluations
  • what about the individual partial sums?

16
Programming Model 1
  • Shared Address Space
  • program consists of a collection of threads of
    control,
  • each with a set of private variables
  • e.g., local variables on the stack
  • collectively with a set of shared variables
  • e.g., static variables, shared common blocks,
    global heap
  • threads communicate implicity by writing and
    reading shared variables
  • threads coordinate explicitly by synchronization
    operations on shared variables
  • writing and reading flags
  • locks, semaphores
  • Like concurrent programming on
    uniprocessor

A
x ...
Shared
y ..x ...
Private
i res s
17
Machine Model 1
  • A shared memory machine
  • Processors all connected to a large shared memory
  • Local memory is not (usually) part of the
    hardware
  • Sun, DEC, Intel SMPs (Symmetric
    multiprocessors) in Millennium SGI Origin
  • Cost much cheaper to cache than main memory
  • Machine model 1a A Shared Address Space Machine
  • replace caches by local memories (in abstract
    machine model)
  • this affects the cost model -- repeatedly
    accessed data should be copied
  • Cray T3E

18
Shared Memory code for computing a sum
Thread 2 s 0 initially local_s2 0
for i n/2, n-1 local_s2 local_s2
f(Ai) s s local_s2
Thread 1 s 0 initially local_s1 0
for i 0, n/2-1 local_s1 local_s1
f(Ai) s s local_s1
What could go wrong?
19
Pitfall and solution via synchronization
  • Pitfall in computing a global sum s local_s1
    local_s2

Thread 1 (initially s0) load s from mem to
reg s slocal_s1 local_s1, in reg
store s from reg to mem
Thread 2 (initially s0) load s from
mem to reg initially 0 s slocal_s2
local_s2, in reg store s from reg to mem
Time
  • Instructions from different threads can be
    interleaved arbitrarily
  • What can final result s stored in memory be?
  • Race Condition
  • Possible solution Mutual Exclusion with Locks

Thread 1 lock load s s slocal_s1
store s unlock
Thread 2 lock load s s slocal_s2
store s unlock
  • Locks must be atomic (execute completely without
    interruption)

20
Programming Model 2
  • Message Passing
  • program consists of a collection of named
    processes
  • thread of control plus local address space
  • local variables, static variables, common blocks,
    heap
  • processes communicate by explicit data transfers
  • matching pair of send receive by source and
    dest. proc.
  • coordination is implicit in every communication
    event
  • logically shared data is partitioned over local
    processes
  • Like distributed programming
  • Program with standard
  • libraries MPI, PVM

A
A
n
0
21
Machine Model 2
  • A distributed memory machine
  • Cray T3E, IBM SP2, Clusters
  • Processors all connected to own memory (and
    caches)
  • cannot directly access another processors memory
  • Each node has a network interface (NI)
  • all communication and synchronization done
    through this

22
Computing s x(1)x(2) on each processor
  • First possible solution

Processor 2 receive xremote, proc1 send
xlocal, proc1 xlocal x(2) s
xlocal xremote
Processor 1 send xlocal, proc2
xlocal x(1) receive xremote, proc2 s
xlocal xremote
  • Second possible solution - what could go wrong?

Processor 1 send xlocal, proc2
xlocal x(1) receive xremote, proc2 s
xlocal xremote
Processor 2 send xlocal, proc1
xlocal x(2) receive xremote, proc1 s
xlocal xremote
  • What if send/receive act like the telephone
    system? The post office?

23
Programming Model 3
  • Data Parallel
  • Single sequential thread of control consisting of
    parallel operations
  • Parallel operations applied to all (or defined
    subset) of a data structure
  • Communication is implicit in parallel operators
    and shifted data structures
  • Elegant and easy to understand and reason about
  • Not all problems fit this model
  • Like marching in a regiment

A array of all data fA f(A) s sum(fA)
s
  • Think of Matlab

24
Model 3
  • Vector Computing
  • One instruction executed across all the data in a
    pipelined fashion
  • Parallel operations applied to all (or defined
    subset) of a data structure
  • Communication is implicit in parallel operators
    and shifted data structures
  • Elegant and easy to understand and reason about
  • Not all problems fit this model
  • Like marching in a regiment

A array of all data fA f(A) s sum(fA)
s
  • Think of Matlab

25
Machine Model 3
  • An SIMD (Single Instruction Multiple Data)
    machine
  • A large number of small processors
  • A single control processor issues each
    instruction
  • each processor executes the same instruction
  • some processors may be turned off on any
    instruction

control processor
. . .
interconnect
  • Machines not popular (CM2), but programming model
    is
  • implemented by mapping n-fold parallelism to p
    processors
  • mostly done in the compilers (HPF High
    Performance Fortran)

26
Machine Model 4
  • Since small shared memory machines (SMPs) are the
    fastest commodity machine, why not build a larger
    machine by connecting many of them with a
    network?
  • CLUMP Cluster of SMPs
  • Shared memory within one SMP, message passing
    outside
  • Clusters, ASCI Red (Intel), ...
  • Programming model?
  • Treat machine as flat, always use message
    passing, even within SMP (simple, but ignore
    important part of memory hierarchy)
  • Expose two layers shared memory (OpenMP) and
    message passing (MPI) higher performance, but
    ugly to program.

27
Programming Model 5
  • Bulk Synchronous Processing (BSP) L. Valiant
  • Used within the message passing or shared memory
    models as a programming convention
  • Phases separated by global barriers
  • Compute phases all operate on local data (in
    distributed memory)
  • or read access to global data (in shared memory)
  • Communication phases all participate in
    rearrangement or reduction of global data
  • Generally all doing the same thing in a phase
  • all do f, but may all do different things within
    f
  • Simplicity of data parallelism without
    restrictions

28
Summary so far
  • Historically, each parallel machine was unique,
    along with its programming model and programming
    language
  • You had to throw away your software and start
    over with each new kind of machine - ugh
  • Now we distinguish the programming model from the
    underlying machine, so we can write portably
    correct code, that runs on many machines
  • MPI now the most portable option, but can be
    tedious
  • Writing portably fast code requires tuning for
    the architecture
  • Algorithm design challenge is to make this
    process easy
  • Example picking a blocksize, not rewriting whole
    algorithm

29
Recap
  • Parallel Comp. Architecture driven by familiar
    technological and economic forces
  • application/platform cycle, but focused on the
    most demanding applications
  • hardware/software learning curve
  • More attractive than ever because best building
    block - the microprocessor - is also the fastest
    BB.
  • History of microprocessor architecture is
    parallelism
  • translates area and denisty into performance
  • The Future is higher levels of parallelism
  • Parallel Architecture concepts apply at many
    levels
  • Communication also on exponential curve
  • gt Quantitative Engineering approach

Speedup
30
History
  • Parallel architectures tied closely to
    programming models
  • Divergent architectures, with no predictable
    pattern of growth.
  • Mid 80s renaissance

Application Software
System Software
Systolic Arrays
SIMD
Architecture
Message Passing
Dataflow
Shared Memory
31
Programming Model
  • Conceptualization of the machine that programmer
    uses in coding applications
  • How parts cooperate and coordinate their
    activities
  • Specifies communication and synchronization
    operations
  • Multiprogramming
  • Independent jobs, no communication or synch. at
    program level
  • Shared address space
  • like bulletin board
  • Message passing
  • like letters or phone calls, explicit point to
    point
  • Data parallel
  • more regimented, global actions on data
  • Implemented with shared address space or message
    passing

32
Economics
  • Commodity microprocessors not only fast but CHEAP
  • Development costs tens of millions of dollars
  • BUT, many more are sold compared to
    supercomputers
  • Crucial to take advantage of the investment, and
    use the commodity building block
  • Multiprocessors being pushed by software vendors
    (e.g. database) as well as hardware vendors
  • Standardization makes small, bus-based SMPs
    commodity
  • Desktop few smaller processors versus one larger
    one?
  • Multiprocessor on a chip?

33
Performance Numbers on RISC Processors
  • Using Linpack Benchmark

34
Consider Scientific Supercomputing
  • Proving ground and driver for innovative
    architecture and techniques
  • Market smaller relative to commercial as MPs
    become mainstream
  • Dominated by vector machines starting in 70s
  • Microprocessors have made huge gains in
    floating-point performance
  • high clock rates
  • pipelined floating point units (e.g.,
    multiply-add every cycle)
  • instruction-level parallelism
  • effective use of caches (e.g., automatic
    blocking)
  • Plus economics
  • Large-scale multiprocessors replace vector
    supercomputers

35
Performance Development
60G - 400 M7.2 Tflop/s 94Gflop/s, Schwab 24,
1/2 per year, 394 gt 100 Gf, faster than Moores
law
36
Performance Development
My Laptop
Entry 1 T 2005 and 1 P 2010
37
Architectures
Constellation of p/n ? n
38
Chip Technology
39
Manufacturer
IBM 32, HP 30, SGI 8, Cray 8, SUN 6, Fuji
4, NEC 3, Hitachi 3
40
High-Performance Computing Directions
Beowulf-class PC Clusters
Definition
Advantages
  • COTS PC Nodes
  • Pentium, Alpha, PowerPC, SMP
  • COTS LAN/SAN Interconnect
  • Ethernet, Myrinet, Giganet, ATM
  • Open Source Unix
  • Linux, BSD
  • Message Passing Computing
  • MPI, PVM
  • HPF
  • Best price-performance
  • Low entry-level cost
  • Just-in-place configuration
  • Vendor invulnerable
  • Scalable
  • Rapid technology tracking

Enabled by PC hardware, networks and operating
system achieving capabilities of scientific
workstations at a fraction of the cost and
availability of industry standard message passing
libraries. However, much more of a contact sport.
41
  • Peak performance
  • Interconnection
  • http//clusters.top500.org
  • Benchmark results to follow in the coming months

42
Distributed and Parallel Systems
Distributed systems hetero- geneous
Massively parallel systems homo- geneous
Parallel Dist mem
Grid Computing
ASCI Tflops
Berkley NOW
SETI_at_home
SNL Cplant
Beowulf
Entropia
  • Gather (unused) resources
  • Steal cycles
  • System SW manages resources
  • System SW adds value
  • 10 - 20 overhead is OK
  • Resources drive applications
  • Time to completion is not critical
  • Time-shared
  • Bounded set of resources
  • Apps grow to consume all cycles
  • Application manages resources
  • System SW gets in the way
  • 5 overhead is maximum
  • Apps drive purchase of equipment
  • Real-time constraints
  • Space-shared

43
Different Parallel Architectures
  • Parallel computing single systems with many
    processors working on same problem
  • Distributed computing many systems loosely
    coupled by a scheduler to work on related
    problems
  • Grid Computing many systems tightly coupled by
    software, perhaps geographically distributed, to
    work together on single problems or on related
    problems

44
Performance Improvements for Scientific Computing
Problems
45
Shared Memory gt Shared Addr. Space
  • Bottom-up engineering factors
  • Programming concepts
  • Why its attactive.

46
Adding Processing Capacity
  • Memory capacity increased by adding modules
  • I/O by controllers and devices
  • Add processors for processing!
  • For higher-throughput multiprogramming, or
    parallel programs

47
Historical Development
  • Mainframe approach
  • Motivated by multiprogramming
  • Extends crossbar used for Mem and I/O
  • Processor cost-limited gt crossbar
  • Bandwidth scales with p
  • High incremental cost
  • use multistage instead
  • Minicomputer approach
  • Almost all microprocessor systems have bus
  • Motivated by multiprogramming, TP
  • Used heavily for parallel computing
  • Called symmetric multiprocessor (SMP)
  • Latency larger than for uniprocessor
  • Bus is bandwidth bottleneck
  • caching is key coherence problem
  • Low incremental cost

48
Shared Physical Memory
  • Any processor can directly reference any memory
    location
  • Any I/O controller - any memory
  • Operating system can run on any processor, or
    all.
  • OS uses shared memory to coordinate
  • Communication occurs implicitly as result of
    loads and stores
  • What about application processes?

49
Shared Virtual Address Space
  • Process address space plus thread of control
  • Virtual-to-physical mapping can be established so
    that processes shared portions of address space.
  • User-kernel or multiple processes
  • Multiple threads of control on one address space.
  • Popular approach to structuring OSs
  • Now standard application capability (ex POSIX
    threads)
  • Writes to shared address visible to other threads
  • Natural extension of uniprocessors model
  • conventional memory operations for communication
  • special atomic operations for synchronization
  • also load/stores

50
Structured Shared Address Space
  • Add hoc parallelism used in system code
  • Most parallel applications have structured SAS
  • Same program on each processor
  • shared variable X means the same thing to each
    thread

51
Engineering Intel Pentium Pro Quad
  • All coherence and multiprocessing glue in
    processor module
  • Highly integrated, targeted at high volume
  • Low latency and bandwidth

52
Engineering SUN Enterprise
  • Proc mem card - I/O card
  • 16 cards of either type
  • All memory accessed over bus, so symmetric
  • Higher bandwidth, higher latency bus

53
Scaling Up
M
M
M

Network
Network


M
M
M






P
P
P
P
P
P
Dance hall
Distributed memory
  • Problem is interconnect cost (crossbar) or
    bandwidth (bus)
  • Dance-hall bandwidth still scalable, but lower
    cost than crossbar
  • latencies to memory uniform, but uniformly large
  • Distributed memory or non-uniform memory access
    (NUMA)
  • Construct shared address space out of simple
    message transactions across a general-purpose
    network (e.g. read-request, read-response)
  • Caching shared (particularly nonlocal) data?

54
Engineering Cray T3E
  • Scale up to 1024 processors, 480MB/s links
  • Memory controller generates request message for
    non-local references
  • No hardware mechanism for coherence
  • SGI Origin etc. provide this

55
Systolic Arrays
SIMD
Generic Architecture
Message Passing
Dataflow
Shared Memory
56
Message Passing Architectures
  • Complete computer as building block, including
    I/O
  • Communication via explicit I/O operations
  • Programming model
  • direct access only to private address space
    (local memory),
  • communication via explicit messages
    (send/receive)
  • High-level block diagram
  • Communication integration?
  • Mem, I/O, LAN, Cluster
  • Easier to build and scale than SAS
  • Programming model more removed from basic
    hardware operations
  • Library or OS intervention

57
Message-Passing Abstraction
  • Send specifies buffer to be transmitted and
    receiving process
  • Recv specifies sending process and application
    storage to receive into
  • Memory to memory copy, but need to name processes
  • Optional tag on send and matching rule on receive
  • User process names local data and entities in
    process/tag space too
  • In simplest form, the send/recv match achieves
    pairwise synch event
  • Other variants too
  • Many overheads copying, buffer management,
    protection

58
Evolution of Message-Passing Machines
  • Early machines FIFO on each link
  • HW close to prog. Model
  • synchronous ops
  • topology central (hypercube algorithms)

CalTech Cosmic Cube (Seitz, CACM Jan 95)
59
Diminishing Role of Topology
  • Shift to general links
  • DMA, enabling non-blocking ops
  • Buffered by system at destination until recv
  • Storeforward routing
  • Diminishing role of topology
  • Any-to-any pipelined routing
  • node-network interface dominates communication
    time
  • Simplifies programming
  • Allows richer design space
  • grids vs hypercubes

Intel iPSC/1 -gt iPSC/2 -gt iPSC/860
H x (T0 n/B) vs T0 HD n/B
60
Example Intel Paragon
61
Building on the mainstream IBM SP-2
  • Made out of essentially complete RS6000
    workstations
  • Network interface integrated in I/O bus (bw
    limited by I/O bus)

62
Highly Parallel Supercomputing Where Are We?
  • Performance
  • Sustained performance has dramatically increased
    during the last year.
  • On most applications, sustained performance per
    dollar now exceeds that of conventional
    supercomputers. But...
  • Conventional systems are still faster on some
    applications.
  • Languages and compilers
  • Standardized, portable, high-level languages such
    as OpenMP, HPF, PVM and MPI are available. But
    ...
  • Initial HPF releases are not very efficient.
  • Message passing programming is tedious and
    hard to debug.
  • Programming difficulty remains a major obstacle
    to usage by mainstream scientist.

63
Achieving TeraFlops
  • In 1991, 1 Gflop/s
  • 1000 fold increase
  • Architecture
  • exploiting parallelism
  • Processor, communication, memory
  • Moores Law
  • Algorithm improvements
  • block-partitioned algorithms

64
Future Petaflops ( fl pt ops/s)
  • dynamic redistribution of
    workload
  • new language and constructs
  • role of numerical libraries
  • algorithm adaptation to hardware failure
  • A Pflop for 1 second ? a typical workstation
    computing for 1 year.
  • From an algorithmic standpoint
  • concurrency
  • data locality
  • latency sync
  • floating point accuracy

65
Petaflop (1015 flop/s) Computers Within the Next
Decade
  • Five basis design points
  • Conventional technologies
  • 4.8 GHz processor, 8000 nodes, each w/16
    processors
  • Processing-in-memory (PIM) designs
  • Reduce memory access bottleneck
  • Superconducting processor technologies
  • Digital superconductor technology, Rapid
    Single-Flux-Quantum (RSFQ) logic hybrid
    technology multi-threaded (HTMT)
  • Special-purpose hardware designs
  • Specific applications e.g. GRAPE Project in Japan
    for gravitational force computations
  • Schemes utilizing the aggregate computing power
    of processors distributed on the web
  • SETI_at_home 26 Tflop/s

66
Petaflops (1015 flop/s) Computer Today?
  • 1 GHz processor (O(109) ops/s)
  • 1 Million PCs
  • 1B (1K each)
  • 100 Mwatts
  • 5 acres
  • 1 Million Windows licenses!!
  • PC failure every second

67
Outline
  • A little history
  • IEEE floating point formats
  • Error analysis
  • Exception handling
  • Using exception handling to go faster
  • How to get extra precision cheaply
  • Cray arithmetic - a pathological example
  • Dangers of Parallel and Heterogeneous Computing

68
A little history
  • Von Neumann and Goldstine - 1947
  • Cant expect to solve most big ngt15 linear
    systems without carrying many decimal digits
    dgt8, otherwise the computed answer would be
    completely inaccurate. - WRONG!
  • Turing - 1949
  • Carrying d digits is equivalent to changing the
    input data in the d-th place and then solving
    Axb. So if A is only known to d digits, the
    answer is as accurate as the data deserves.
  • Backward Error Analysis
  • Rediscovered in 1961 by Wilkinson and publicized
  • Starting in the 1960s- many papers doing backward
    error analysis of various algorithms
  • Many years where each machine did FP arithmetic
    slightly differently
  • Both rounding and exception handling differed
  • Hard to write portable and reliable software
  • Motivated search for industry-wide standard,
    beginning late 1970s
  • First implementation Intel 8087
  • ACM Turing Award 1989 to W. Kahan for design of
    the IEEE Floating Point Standards 754 (binary)
    and 854 (decimal)
  • Nearly universally implemented in general purpose
    machines

69
Defining Floating Point Arithmetic
  • Representable numbers
  • Scientific notation /- d.dd x rexp
  • sign bit /-
  • radix r (usually 2 or 10, sometimes 16)
  • significand d.dd (how many base-r digits d?)
  • exponent exp (range?)
  • others?
  • Operations
  • arithmetic ,-,x,/,...
  • how to round result to fit in format
  • comparison (lt, , gt)
  • conversion between different formats
  • short to long FP numbers, FP to integer
  • exception handling
  • what to do for 0/0, 2largest_number, etc.
  • binary/decimal conversion
  • for I/O, when radix not 10
  • Language/library support for these operations

70
IEEE Floating Point Arithmetic Standard 754 -
Normalized Numbers
  • Normalized Nonzero Representable Numbers -
    1.dd x 2exp
  • Macheps Machine epsilon 2-significand bits
    relative error in each operation
  • OV overflow threshold largest number
  • UN underflow threshold smallest number
  • - Zero -, significand and exponent all zero
  • Why bother with -0 later

Format bits significand bits macheps
exponent bits exponent range ----------
-------- -----------------------
------------ --------------------
---------------------- Single 32
231 2-24 (10-7) 8
2-126 - 2127 (10-38) Double
64 521 2-53
(10-16) 11 2-1022 - 21023
(10-308) Double gt80 gt64
lt2-64(10-19) gt15 2-16382
- 216383 (10-4932) Extended (80 bits on all
Intel machines)
71
IEEE Floating Point Arithmetic Standard 754 -
Denorms
  • Denormalized Numbers -0.dd x 2min_exp
  • sign bit, nonzero significand, minimum exponent
  • Fills in gap between UN and 0
  • Underflow Exception
  • occurs when exact nonzero result is less than
    underflow threshold UN
  • Ex UN/3
  • return a denorm, or zero
  • Why bother?
  • Necessary so that following code never divides by
    zero
  • if (a ! b) then x a/(a-b)

72
IEEE Floating Point Arithmetic Standard 754 - -
Infinity
  • - Infinity Sign bit, zero significand,
    maximum exponent
  • Overflow Exception
  • occurs when exact finite result too large to
    represent accurately
  • Ex 2OV
  • return - infinity
  • Divide by zero Exception
  • return - infinity 1/-0
  • sign of zero important!
  • Also return - infinity for
  • 3infinity, 2infinity, infinityinfinity
  • Result is exact, not an exception!

73
IEEE Floating Point Arithmetic Standard 754 - NAN
(Not A Number)
  • NAN Sign bit, nonzero significand, maximum
    exponent
  • Invalid Exception
  • occurs when exact result not a well-defined real
    number
  • 0/0
  • sqrt(-1)
  • infinity-infinity, infinity/infinity,
    0infinity
  • NAN 3
  • NAN gt 3?
  • Return a NAN in all these cases
  • Two kinds of NANs
  • Quiet - propagates without raising an exception
  • Signaling - generate an exception when touched
  • good for detecting uninitialized data

74
Error Analysis
  • Basic error formula
  • fl(a op b) (a op b)(1 d) where
  • op one of ,-,,/
  • d lt macheps
  • assuming no overflow, underflow, or divide by
    zero
  • Example adding 4 numbers
  • fl(x1x2x3x4) (x1x2)(1d1) x3(1d2)
    x4(1d3)

  • x1(1d1)(1d2)(1d3) x2(1d1)(1d2)(1d3)

  • x3(1d2)(1d3) x4(1d3)
  • x1(1e1)
    x2(1e2) x3(1e3) x4(1e4)
  • where each
    ei lt 3macheps
  • get exact sum of slightly changed summands
    xi(1ei)
  • Backward Error Analysis - algorithm called
    numerically stable if it gives the exact result
    for slightly changed inputs
  • Numerical Stability is an algorithm design goal

75
Example polynomial evaluation using Horners rule
n
  • Horners rule to evaluate p S ck xk
  • p cn, for kn-1 down to 0, p xp ck
  • Numerically Stable
  • Apply to (x-2)9 x9 - 18x8 - 512

k0
76
Example polynomial evaluation (continued)
  • (x-2)9 x9 - 18x8 - 512
  • We can compute error bounds using
  • fl(a op b)(a op b)(1d)

77
  • What happens when the exact value is not a real
    number, or is too small or too large to represent
    accurately?
  • You get an exception

78
Exception Handling
  • What happens when the exact value is not a real
    number, or too small or too large to represent
    accurately?
  • 5 Exceptions
  • Overflow - exact result gt OV, too large to
    represent
  • Underflow - exact result nonzero and lt UN, too
    small to represent
  • Divide-by-zero - nonzero/0
  • Invalid - 0/0, sqrt(-1),
  • Inexact - you made a rounding error (very
    common!)
  • Possible responses
  • Stop with error message (unfriendly, not default)
  • Keep computing (default, but how?)

79
Exception Handling User Interface
  • Each of the 5 exceptions has the following
    features
  • A sticky flag, which is set as soon as an
    exception occurs
  • The sticky flag can be reset and read by the user
  • reset overflow_flag and invalid_flag
  • perform a computation
  • test overflow_flag and invalid_flag to see if
    any exception occurred
  • An exception flag, which indicate whether a trap
    should occur
  • Not trapping is the default
  • Instead, continue computing returning a NAN,
    infinity or denorm
  • On a trap, there should be a user-writable
    exception handler with access to the parameters
    of the exceptional operation
  • Trapping or precise interrupts like this are
    rarely implemented for performance reasons.

80
Exploiting Exception Handling to Design Faster
Algorithms
  • Paradigm
  • Quick with high probability
  • Assumes exception handling done quickly
  • Ex 1 Solving triangular system Txb
  • Part of BLAS2 - highly optimized, but risky
  • If T nearly singular, expect very large x, so
    scale inside inner loop slow but low risk
  • Use paradigm with sticky flags to detect nearly
    singular T
  • Up to 9x faster on Dec Alpha
  • Ex 2 Computing eigenvalues, up to 1.5x faster on
    CM-5
  • Demmel/Li (www.cs.berkeley.edu/xiaoye)

1) Try fast, but possibly risky algorithm 2)
Quickly test for accuracy of answer (use
exception handling) 3) In rare case of
inaccuracy, rerun using slower low risk
algorithm
For k 1 to n d ak - s - bk2/d if d
lt tol, d -tol if d lt 0, count
For k 1 to n d ak - s - bk2/d ok to
divide by 0 count signbit(d)
vs.
81
Summary of Values Representable in IEEE FP
  • - Zero
  • Normalized nonzero numbers
  • Denormalized numbers
  • -Infinity
  • NANs
  • Signaling and quiet
  • Many systems have only quiet

00
00
Not 0 or all 1s
anything
nonzero
00
1.1
00
1.1
nonzero
82
Hazards of Parallel and Heterogeneous Computing
  • What new bugs arise in parallel floating point
    programs?
  • Ex 1 Nonrepeatability
  • Makes debugging hard!
  • Ex 2 Different exception handling
  • Can cause programs to hang
  • Ex 3 Different rounding (even on IEEE FP
    machines)
  • Can cause hanging, or wrong results with no
    warning
  • See www.netlib.org/lapack/lawns/lawn112.ps
About PowerShow.com