LogP Model - PowerPoint PPT Presentation

About This Presentation
Title:

LogP Model

Description:

L,o,g : independent from P or node distances. Message length: short message ... topology independent usually works for n=1024 nodes. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 12
Provided by: cse6
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: LogP Model


1
LogP Model
  • Motivation
  • BSP Model
  • Limited to BW of Network (g) and Load of PE
  • Requires large load per super steps.
  • Need Better Models for Portable Algorithms
  • Converging Hardware
  • Independent from Network Topology
  • Programming Models
  • Assumption
  • Number of PE much bigger than data elements

2
Parameters
  • L Latency
  • delay on the network
  • o Overhead on PE
  • g gap
  • minimum interval between consecutive messages
    (due to bandwidth)
  • P Number of PEs
  • Note
  • L,o,g independent from P or node distances
  • Message length short message
  • L,o,g are per word or per message of fixed
    length
  • k word message k short messages (ko overhead)
  • L independent from message length

3
Parameters (continue)
  • Bandwidth 1/g unit message length
  • Number of messages to send or receive for each
    PE L/g
  • Send to Receive total time L2o
  • if o gtgt g, ignore o
  • Similar to BSP except no synchronization step
  • No communication computation overlapping
  • Speed-up factor at most two

4
Broadcast
  • Optimal Broad cast tree

0
P1
P3
P4
10
14
18
22
20
24
24
P8, L6, g4, o2
5
Optimal Sum
  • Given time T, how many items we can add?
  • Approach recursive
  • At root, if T lt L2o use a single PE (can add
    T1 items)
  • If T gt L2o,
  • Root should have data ready at T,
  • and sender must have sum ready at T - L - 2o - 1
  • Recursively construct the sum tree at the sender
  • If T - g gt L2o, Root also can receive data, and
    compute the
  • sum with T-g as the root.

6
Applications
  • FFT on the Butterfly network
  • Data Placement
  • cyclic layout - First log n/P local comm, last
    log P global
  • blocked layout - First log P global comm,
    remaining local
  • hybrid After log (n/P) iteration, re-map to
    cyclic so that
  • remaining can be also local
  • Communication time g (n/P2) (P-1) L
  • each PE has n/P data, each of 1/P goes to each
    other PE
  • Total time is (1g/logn) optimal
  • All to all Communication schedule
  • Approach 1 each PE sends PE1, PE2, gt bottle
    neck at PE1, PE2 in this order
  • Approach 2 (staggered re-map) -- no congestion
  • PE1 sends PE2, PE3,..
  • PE2 sends PE3, PE4, etc

7
Implementation on CM5
  • CM
  • 33MHz
  • Fat Trees
  • Global Control for scan/prefix/broadcast
  • one CM-5 3.2 MFLOPs
  • FFT on local 2.8 - 2.2 MFLOPs (cache effect)
  • each cycle
  • multiply and add 4.5 us
  • o 2us
  • L 6us
  • g 4us
  • load ans store overhaed per cycle 1us
  • communication time n/P max (1us 2o, g) L
  • bottleneck processing and overhead, not bw

8
LU decomposition
  • Data arrangement critical

9
Matching machine with real machines
  • Average Distance
  • topology independent usually works for n1024
    nodes.
  • The difference between average distance and
    max distance are not such different

10
Potential Concerns
  • Algorithmic concern
  • Theory?
  • Too complex?
  • Communication concerns
  • how to use trivial comm such as local exchange
  • topology dependencies?

11
Comparison with BSP
  • Length of superstep
  • message not usable till next step
  • special hardware for sync
  • virtual/physical large, context switching may be
    expensive
Write a Comment
User Comments (0)
About PowerShow.com