Optimizing SMP MessagePassing Systems - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Optimizing SMP MessagePassing Systems

Description:

If blocking on a send, malloc a buffer, copy data, and post the header to the destination ... a completion flag so the source process can free the send buffer ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 45
Provided by: DaveT2
Category:

less

Transcript and Presenter's Notes

Title: Optimizing SMP MessagePassing Systems


1
Optimizing SMP Message-Passing Systems
  • Dave Turner
  • In collaboration with
  • Adam Oline, Xuehua Chen, and Troy Benjegerdes
  • This project is funded by the Mathematical,
    Information, and
  • Computational Sciences division of the Department
    of Energy.

2
The MP_Lite message-passing library
  • A lightweight MPI implementation
  • Highly efficient for the architectures supported
  • Designed to be very user-friendly
  • Ideal for performing message-passing research
  • http//www.scl.ameslab.gov/Projects/MP_Lite/

3
SMP message-passing using a shared-memory segment
Processor
Processor
cache
cache
Process 1
Process 0
Shared-memory segment
Main Memory
  • 2-copy mechanism
  • Variety of methods to manage the segment
  • Variety of methods to signal the destination
    process

4
MP_Lite lock free approach
Each process has its own section for outgoing
messages. Other processes only flip a cleanup
flag No lockouts provide excellent
scaling Doubly linked lists for very efficient
searches, or combine with the shared-memory FIFO
method.
Shared-memory segment
Process 0
0?1
0
Process 1
1?3
1?0
1?2
1
2
Process 2
2?0
3?2
3
Process 3
5
1-copy SMP message-passing for Linux
Kernel Put or
Kernel Get
Processor
Processor
cache
cache
Process 1
Process 0
Kernel copy
Main Memory
  • This should double the throughput
  • Unclear at inception what the latency would be
  • Relatively easy to implement a full MPI
    implementation
  • Similar work was done by the BIP-SMP project

6
The kcopy.c Linux module
  • kcopy_open()
  • Does nothing since it is difficult to pass data
    in
  • kcopy_ioctl
  • KCOPY_INIT initialize synchronization arrays
  • KCOPY_PUT/GET call copy_user_to_user to
    transfer message data
  • KCOPY_SYNC out-of-band synchronization
  • KCOPY_BCAST/GATHER support routines for
    exchanging initial pointers
  • kcopy_release()
  • copy_user_to_user
  • kmap destination pages to kernel space then copy
    1 page at a time

destination
physical
source
4 GB
4 GB
4 GB
kernel
kernel
3 GB
3 GB
user
user
user
1 GB
kernel
0 GB
0 GB
0 GB
7
Programming for the kcopy.o Linux module
  • dd open( /dev/kcopy, O_WRONLY )
  • Open the connection to the kcopy.o module and
    return a device descriptor
  • ioctl( dd, KCOPY_INIT, hdr )
  • Pass myproc and nprocs to the module
  • Initialize the Sync arrays
  • ioctl( dd, KCOPY_PUT, hdr )
  • hdr sbuf, dbuf, myproc, nprocs, dest, nbytes,
    comm (put/get)
  • Checks for write access using access_ok()
  • copy_user_to_user copies nbytes from sbuf to/from
    dbuf
  • ioctl( dd, KCOPY_SYNC, hdr)
  • All processes increment their sync element, then
    wait for a go signal from proc 0
  • ioctl( dd, KCOPY_GATHER, hdr)
  • hdr myproc, nprocs, sizeof(element),
    myelement
  • ? array of elements
  • ioctl( dd, KCOPY_BCAST, hdr)
  • hdr myproc, nprocs, sizeof(element),
    myelement
  • ? element broadcast

8
The MP_Lite kput.c module
  • 1-sided MP_Put/MP_Get are not implemented yet
  • 2-sided communications use both gets and puts
  • 4 separate circular queues are maintained to
    manage the 2-sided communications
  • MP_Send
  • If receive is pre-posted, put the data to the
    destination process
  • Then post a completion flag
  • If blocking on a send, malloc a buffer, copy
    data, and post the header to the destination
  • The destination process will then get the data
    from the source process and post a completion
    flag
  • The source process can use the completion flag to
    free the send buffer
  • MP_Recv
  • If a buffered send is posted, get the data from
    the source process
  • Then post a completion flag so the source process
    can free the send buffer
  • Else pre-post a receive header
  • Then block until a completion notification flag
    is posted in return

9
SMP message-passing between caches
All MPI libraries can be made to achieve a 1 ms
latency. 2-copy methods can still show some
possible benefits in the cache range
(measurements with codes are needed). The 1-copy
mechanism in MP_Lite nearly doubles the
throughput for large messages. Adding an
optimized non-temporal memory copy routine to the
Linux kernel module nearly doubles the
performance again, but only for Pentium 4
systems.
10
SMP message-passing performance without cache
effects
The benefits of the 1-copy method and the
optimized memory copy routine are seen more
clearly when the messages do not start in
cache. There may still be room to improve the
throughput by another 10-20. - Copy
contiguous pages - Get rid of the spin-lock
in kmap Are optimized memory copy routines
available for other processors??? This Linux
module is the perfect place to put any optimized
memory routines.
11
0-copy SMP message-passing using a copy-on-write
mechanism
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
1
2
3
Main Memory
  • ? If source and destination buffers are
    similarly aligned
  • ? If the message contains full pages to copy
  • ? Give access to both processes and mark as
    copy-on-write
  • Pages will only be copied if either process
    writes to it
  • - Both processes could easily trample the
    buffers much later
  • Effective use would require the user to align
    buffers (posix_memalign)
  • The user must also protect against trampling
    (malloc before, free right after)
  • - This still does not help with transferring
    writable buffers

12
0-copy SMP message-passing by transferring
ownership of the physical pages
MP_SendFree() MP_MallocRecv()
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
3
2
Main Memory
1
  • If the source node does not need to retain a
    copy, do an MP_SendFree().
  • The kernel can re-assign the physical pages to
    the destination process.
  • The destination process then does an
    MP_MallocRecv().
  • Only the partially filled pages would need to be
    copied.
  • Can also provide alignment help to page-align
    and pad buffers
  • Destination process owns the new pages (they
    are writable)
  • - Requires extensions to the MPI standard
  • Must modify the code, but should be fairly
    easy (but what about Fortran?)
  • Can be combined with copy-on-write to share
    read-only data

13

w
i
t
h

o
r

w
i
t
h
o
u
t

f
e
n
c
e

c
a
l
l
s
.



M
e
a
s
u
r
e

p
e
r
f
o
r
m
a
n
c
e

o
r

d
o

a
n

i
n
t
e
g
r
i
t
y

t
e
s
t
.
http//www.scl.ameslab.gov/Projects/NetPIPE/
NetPIPE_4.x CVS repository on AFS
14
NetPIPE 4.x options
-H hostfile ? Choose a host file name other
than the default nphosts -B ? bidirectional
mode -a ? asynchronous sends and receives -S
? Do synchronous sends and receives -O s,d
? Offset the source buffer by s bytes and the
destination buffer by d bytes -I ?
Invalidate cache (no cache effects) -i ? Do
an integrity check (do not measure
performance) -s ? Stream data in one
direction only
15
Aggregate Measurements
  • Overlapping pair-wise ping-pong tests.
  • Measure switch performance or line speed
    limitations
  • Use bidirectional mode to fully saturate and
    avoid synchronization concerns

Ethernet Switch
n0
n1
n2
n3
Line speed vs end-point limited
n0
n1
n2
n3
  • Investigate other methods for testing the
    global network.
  • Evaluate the full range from simultaneous
    nearest neighbor communications to all-to-all.

16
The IBM SP at NERSC
  • 416 16-way SMP nodes connected by an SP switch
  • 380 IBM Nighthawk compute nodes ? 6080 compute
    processors
  • 16-processor SMP node
  • 375 MHz Power3 processors
  • 4 Flops / cycle ? peak of 1500 MFlops/processor
  • ALCMD gets around 150 MFlops/processor
  • 16 GB RAM / node (some 32 64 GB nodes)
  • Limited to 2 GB / process
  • AIX with MPI or OpenMP
  • IBM Colony switch connecting the SMP nodes
  • 2 network adaptors per node
  • MPI
  • http//hpcf.nersc.gov/computers/SP/

17
SMP message-passing performance on an IBM SP node
The aggregate bi-directional bandwidth is 4500
Mbps between one pair of processes on the same
SMP node with a latency of 14 us. The bandwidth
scales ideally for two pairs communicating
simultaneously. Efficiency drops 80 when 4 pairs
are communicating, saturating the main memory
bandwidth on the node. Communication bound codes
will suffer when run on all 16 processors due to
this saturation.
18
Message-passing performance between IBM SP nodes
The aggregate bi-directional bandwidth is 3 Gbps
between one pair of processes between nodes with
a latency of 23 us. The bandwidth scales ideally
for two pairs communicating simultaneously, which
makes sense given that there are two network
adapter cards per node. Efficiency drops 70 when
4 pairs are communicating, just about saturating
the off-node communication bandwidth. Communicatio
n bound codes will suffer when run on more than 4
processors due to this saturation.
19
12-processor Cray XD1 chassis
  • 6 dual-processor Opteron nodes ? 12 processors
  • 2-processor SMP nodes
  • 2.2 GHz Opteron processors
  • 2 GB RAM / node
  • SuSE Linux with the 2.4.21 kernel
  • RapidArray interconnect
  • 2 network chips per node
  • MPICH 1.2.5

20
Message-passing performance on a Cray XD1
TCP performance reaches 2 Gbps with a 10 us
latency. The MPI performance between processors
on a dual-processor SMP node on the Cray XD1
reaches 7.5 Gbps with a 1.8 us latency. MPI
performance between nodes reaches 10 Gbps with a
1.8 us latency. With MPI, it is currently faster
to communicate between processors on separate
nodes than within the same SMP node.
21
Message-passing performance between nodes on a
Cray XD1
The MPI performance between 2 processors on a
dual-processor SMP node on the Cray XD1 reaches
7.5 Gbps with a 1.8 us latency. There are
severe dropouts starting at 8 kB. Similar
dropouts have been seen in an InfiniBand module
by D.K. Panda based on MPICH. That module may be
the source of the Cray XD1 MPI version. We need
to work with Cray, D.K Panda, and Argonne to
resolve this problem.
22
Message-passing performance on a Cray XD1 SMP node
The MPI performance between 2 processors
on a dual-processor SMP node on the Cray XD1
reaches 7.5 Gbps with a 1.8 us latency. The
characteristics show no sign of a special SMP
module, so data probably goes out to the network
chip and back to the 2nd processor. There are
severe dropouts starting at 8 kB similar to
off-node performance.
23
Aggregate message-passing performance between
nodes
The MPI performance between 2 nodes
reached a maximum of 10 Gbps. The aggregate
performance for bidirectional communications
between the same 2 nodes reaches 15 Gbps, showing
a loss of 50. The aggregate performance across
the same link using both processors on each node
reaches the same maximum of 15 Gbps. This is
measuring the limitation of a single link. We do
not have access to a large enough system to for
saturation of the RapidArray switch.
24
Aggregate message-passing performance across the
switch
The aggregate performance across one link using
both processors on each node reaches 15 Gbps. The
same measurement using 4 nodes (4 pairs of
processors communicating simultaneously) reaches
an ideal doubling, showing no saturation of the
switch.
25
  • Most applications do not take advantage of the
    network topology
  • ? There are many different topologies, which
    can even vary at run-time
  • ? Schedulers usually do not allow requests
    for a given topology
  • ? no portable method for mapping to the
    topology
  • ? loss of performance and scalability
  • NodeMap will automatically determine the
    network topology at run-time
  • and pass the information to the
    message-passing library.

26
  • Network modules identify the topology using a
    variety of discovery techniques.
  • Hosts and switches are stored from nearest
    neighbors out.
  • Topology analysis routines identify regular
    topologies.
  • Latency, maximum and aggregate bandwidth tests
    provide more accurate performance measurements.
  • The best mapping is provided through the
    MPI_Init or MPI_Cart_create functions.

27
NodeMap network modules
  • Run-time use of NodeMap means it must operate
    very quickly (seconds, not minutes).
  • Use brute force ping-pong measurements as a
    last resort.
  • Reduce the complexity of the problem at each
    stage.
  • Always identify SMP processes first using
    gethostname.
  • Identify each nodes nearest neighbors, then
    work outwards.
  • Store the neighbor and switch information in
    the manner it is discovered.
  • Store local neighbors, then 2nd neighbors, etc.
  • This data structure based on discovery makes it
    easy to write the topology analysis routines.
  • The network type or types should be known from
    the MPI configuration.
  • This identifies which network modules to run.
  • The MPI configuration provides paths to the
    native software libraries.

28
Myrinet static.map file
s - "s0" 15 0 s - "s1" 14 1 s - "s2" 14 2 s -
"s3" 14 3 s - "s4" 14 4 s - "s5" 14 5 s - "s6"
14 6 s - "s7" 14 7 s - "s8" 14 9 h - "a18" 0 10 h
- "a19" 0 11 h - "4pack" 0 12 h - "m22" 0 13 h -
"m19" 0 14 h - "m18" 0 15 h - "m20" 0   s -
"s1" 8 0 s - "s9" 0 1 s - "s10" 0 6 s - "s11"
0 11 s - "s12" 0 12 s - "s13" 0 13 s - "s14" 0 14
s - "s0" 0 15 s - "s15" 0
h - "m22" 1 0 s - "s0" 12 number 0 address
0060dd7fb1f9 gmId 78 hostType 0   h - "m27" 1 0 s
- "s14" 14 number 0 address 0060dd7fb0e8 gmId
62 hostType 0
  • Parse the gm/sbin/static.map file
  • Each host has an entry
  • -- Connected to what switch?
  • Each switch has an entry
  • -- Lists all hosts connected
  • -- Lists all switches connected
  • Internal switches have no hosts
  • Determine the complete topology
  • Determine our topology

29
Myrinet module for NodeMap
4packgt gm_board_info lanai_cpu_version
0x0900 (LANai9.0) max_lanai_speed 134 MHz
(should be labeled "M3M-PCI64B-2-59521")   gmID
MAC Address gmName Route ----
----------------- ---------- ------------- 2
0060dd7fb1c0 m26 ba bd 88 3
0060dd7fb0e0 m18 83 4
0060dd7fb1f6 m29 ba b3 89 55
0060dd7facb2 m25 ba b2 88 56
0060dd7fb0ec m24 ba bf 86 58
0060dd7fb106 m20 84 59
0060dd7fb0e3 m19 82 61
0060dd7fb1bd m28 ba be 86 62
0060dd7fb0e8 m27 ba bf 89 67
0060dd7faca8 m23 ba be 89 77
0060dd7facb0 4pack 80 (this node) 78
0060dd7fb1f9 m22 ba be 88 80
0060dd7fb0ed m32 ba b3 88 93
0060dd7fb0e1 m31 ba be 87
  • Probe using gm_board_info.
  • Use header info to ID board.
  • Verify clock rate with dmesg
  • Provides exact routing
  • - Not really needed
  • Measure the latency and bandwidth
  • Provide the best mapping

30
InfiniBand module for NodeMap
opteron1gt minism InfiniHost0 minismgtd   New
Discovered Node New Node - TypeCA NumPorts02
LID0003 New Discovered Node New Link 4x
FromLID0003 FromPort01 ToLID0004 ToPort08  
New Node - TypeSw NumPorts08 LID0004 No
Link 1x FromLID0004 FromPort01 No Link 1x
FromLID0004 FromPort02 No Link 1x
FromLID0004 FromPort03 No Link 1x
FromLID0004 FromPort04 New Discovered Node
New Link 4x FromLID0004 FromPort05 ToLID0002
ToPort06 New Link 4x FromLID0004 FromPort06
ToLID0002 ToPort05 New Link 4x FromLID0004
FromPort07 ToLID0009 ToPort01 New Link 4x
FromLID0004 FromPort08 ToLID0003
ToPort01   New Discovered Node New Node -
TypeCA NumPorts02 LID0009 New Link 4x
FromLID0009 FromPort01 ToLID0004 ToPort07
  • Probe the subnet manager (minism or other)
  • Identify my LID
  • Exchange LIDs
  • Parse and store the links, switches, and other
    HCAs
  • This is all that is needed to determine the
    topology

31
IP module for NodeMap
  • IP interface can be to many types of network
    hardware.
  • Ethernet, ATM, IP over Myrinet, IP over
    InfiniBand, etc.
  • How to determine what network cards are
    present?
  • ifconfig provides a list of active interfaces
  • Does tell what type of network
  • May tell what speed for Ethernet
  • lspci, hinv provide a description of what is
    plugged into the PCI slots
  • Sometimes helpful, but may require a database
  • Can measuring latency identify the number of
    switches in between?
  • This may require many point-to-point
    measurements
  • OS, driver, and NIC can affect the latency
    themselves
  • It may identify which nodes are on a local
    switch
  • ? Can simultaneous measurements be done to make
    this efficient?
  • Use aggregate measurements to probe higher
    level switches?

32
MPP modules for NodeMap
  • Use uname or compiler definitions to identify
    the MPP type
  • - _CRAYT3E is defined for the Cray T3E
  • -- _AIX for IBM SP (anything better?)
  • ? These identify the global network topology
  • Use gethostname to identify SMP processes
  • This reduces the complexity of the problem
  • Use vendor functions if available (getphysnode
    on the Cray T3E)
  • Do we need a module for each new type of MPP???

33
A Generic MPI Module
  • How quickly can a brute force analysis be done?
  • Run NodeMap once to generate a static map file of
    the topology?
  • Identify SMP processes first using gethostname
  • Gather information for a single host
  • Measure the latency and maximum throughput to
    one or more nodes
  • Probe using overlapping pair-wise bandwidth
    measurements
  • Try to measure the congestion factor
  • Increase the number of pairs until the
    performance improves ? X-dimension
  • Repeat in the Y-dimension
  • Use 2D global shifts for several regular
    arrangements
  • Measure the bi-sectional bandwidth ? Can
    identify fat trees
  • Additional tricks needed!!!

n0
n1
n2
n3
34
Topological Analysis Routines
  • Initially concentrate only on regular
    arrangements
  • N-dimensional mesh/torus, trees, SMP nodes
  • Identify the topology from the host/switch
    information gathered
  • How many unique 1st, 2nd, 3rd, neighbors ?
    mesh/torus
  • Determine whether a tree is fully connected
  • Eventually handle more irregular arrangements
    of nodes.
  • Identify the general type of network
  • - May have an irregular arrangement of nodes on
    a mesh/torus
  • Identify which nodes are irregular

35
Performance Measurements
  • Performance may be limited by many factors
  • Measure latency and max bandwidth for each
    network layer
  • Measure aggregate performance across switches
    or a given link
  • Performance data can help determine the
    topology
  • A tree with fewer internal switches may still
    be a fat tree
  • Feed the performance data to the Mapper along
    with topological data

36
The Mapper
  • NodeMap will initially be run from
    MPI_Cart_create(, reorder1)
  • User must determine which direction requires
    the optimal passing
  • The Mapper will take the topology and
    performance data and provide the best mapping of
    the desired 2D (or eventually 3D or tree)
    algorithm.
  • Concentrate on regular arrangements first
  • Use Gray codes for 2D onto N-dimensional
    topologies to guarantee nearest neighbors are
    directly connect
  • NodeMap may also be run from MPI_Init()
  • Provide optimal mappings for global operations
    (mainly binary trees)

37
Conclusions
  • Codes must be mapped to the network topology to
    scale well
  • Many codes can use a 2D decomposition
  • 2D algorithms can be mapped ideally to most
    network topologies
  • NodeMap will provide automatic mapping to the
    topology
  • Portable means of taking advantage of the network
    topology
  • Questions
  • How well will NodeMap handle irregular networks?
  • Will it be difficult to provide a reasonable
    mapping?
  • Can a generic MPI module effectively discover a
    topology?
  • If so, how quickly can this be done?
  • Will NodeMap need to generate static map files
    ahead of time?

38
Contact information
  • Dave Turner - turner_at_ameslab.gov
  • http//www.scl.ameslab.gov/Projects/MP_Lite/
  • http//www.scl.ameslab.gov/Projects/NetPIPE/

39
Ames Lab Classical Molecular Dynamics code
Embedded atom method, Leonard Jones, Tersoff
potentials Uses cubic spline functions for
optimal performance Local interactions only,
typically 5-6 A ? 50-60 neighbors per atom
2D decomposition of the 3D simulation space Map
neighboring regions to neighboring nodes to
localize communications. Shift coordinates and
accumulators to all nodes above and to the right
to calculate all pair interactions within the
cutoff range. Large systems require just 5
communications, while systems spread across more
nodes may involve many more passes in a
serpentine fashion around half the
interaction range.
40
ALCMD scaling on the IBM SP
Proper mapping of the columns to SMP nodes helps
greatly. Parallel efficiency goes from 50 to 70
for 10,000,000 atoms on 1024 processors. Scaling
beyond 1024 processors will be difficult. On-node
and off-node communications are saturated even at
1024 processors (16 x 16-way SMPs)
41
2D decomposition of algorithms
  • Many codes can be naturally decomposed onto a 2D
    mesh.
  • Atomistic codes Classical and Order-N
    Tight-Binding
  • Many matrix operations
  • Finite difference and finite element codes
  • Grid and multi-grid approaches
  • 2D decompositions map well to most network
    topologies.
  • 2D and 3D meshes and toruses, hypercubes, trees,
    fat trees
  • Direct connections to nearest neighbor nodes
    prevents contention
  • Writing algorithms using a 2D decomposition can
    provide the initial step to taking advantage of
    the network topology.

42
Point-to-point Performance
InfiniBand can deliver 4500 - 6500 Mbps at a 7.5
us latency. Atoll delivers 1890 Mbps with a 4.7
us latency. SCI delivers 1840 Mbps with only a
4.2 us latency. Myrinet performance reaches 1820
Mbps with an 8 us latency. Channel-bonded GigE
offers 1800 Mbps for very large messages. Gigabit
Ethernet delivers 900 Mbps with a 25-62
us latency. 10 GigE only delivers 2 Gbps with a
75 us latency.
43
Evaluating network performance using ping-pong
tests
  • Bounce messages back and forth between
    processes to measure the performance
  • Messages are bounced many times for each test
    to provide an accurate timing
  • The latency is the minimum time needed to send
    a small message (1/2 round trip time)
  • The throughput is the communication rate
  • Start with a small message size (1 byte) and
    increase exponentially up to 8 MB
  • Use perturbations from the factors of 2 to
    fully test the system

Switch
n0
n1
n2
n3
44
Using NetPIPE 4.x
mpirun np nprocs -hostfile nphosts NPmpi
NetPIPE options -H hostfile option to NPmpi
can be used to change the order of the
pairings nplaunch script for launching other
NetPIPE executables Default host file name is
nphosts Lists hosts with first and last
communicating, 2nd and 2nd to last, etc. nplaunch
NPtcp NetPIPE options For aggregate
measurements, use B for bidirectional mode. For
NPmpi, -a may be needed for asynchronous sends
and receives
Write a Comment
User Comments (0)
About PowerShow.com