Title: Opportunities%20and%20Challenges%20of%20Modern%20Communication%20Architectures:%20Case%20Study%20with%20QsNet%20%20CAC%20Workshop%20Santa%20Fe,%20NM,%202004%20%20Sameer%20Kumar*%20and%20Laxmikant%20V.%20Kale%20Parallel%20Programming%20Laboratory%20University%20of%20Illinois%20at%20Urbana%20Champaign
1Opportunities and Challenges of Modern
Communication Architectures Case Study with
QsNet CAC WorkshopSanta Fe, NM, 2004 Sameer
Kumar and Laxmikant V. KaleParallel Programming
LaboratoryUniversity of Illinois at Urbana
Champaign
2Outline
- Processor virtualization
- QsNet
- Opportunities
- Performance Evaluation of QsNet
- Challenges of QsNet
- Summary
3Processor Virtualization
- Basic idea of processor virtualization
- User specifies interaction between objects (VPs)
- RTS maps VPs onto physical processors
- Typically, virtual processors gt processors
- Embodied in Charm and AMPI
4QsNet
- Popular interconnect from Quadrics
- Several parallel systems in top500 use QsNet
- Pittsburghs Lemieux (6TF)
- ASCI-Q (20TF)
- Elite network
- Elan adaptor
5Elite Network
- 320 MB/s each way after protocol
- Reliable fat-tree network
- Multiple routes provides fault tolerance
- Adaptive worm hole routing
- 35 ns per hop
6Elan Network Adaptor
- Features
- Low latency (4.5 µs for MPI)
- High bandwidth (320MB/s/node)
- Components
- Sparc processor
- DMA Engine
- 64 MB RAM
- On chip cache
7Low CPU Overhead
CPU Overhead is small and does not change much
with the message size
8Traditional Message Passing
Send Overhead
Receive Overhead
P0
P1
Time
Traditional Message Passing does not utilize low
CPU overhead of Elan
9Adaptive Overlap
Send Overhead
Receive Overhead
P0
VP0
VP1
VP0
VP1
Time
Processor Virtualization takes full advantage of
the low CPU overhead of Elan
10Benefit of Adaptive Overlap
Problem setup 3D stencil calculation of size
2403 run on Lemieux. Shows AMPI with
virtualization ratio of 1 and 8.
11Charm Message Driven Execution
Handler
Scheduler
Pump
Garbage Collection
Send
12NAMD A Production MD System
- Written in Charm
- Fully featured program
- NIH-funded development
- Distributed free of charge (5000 downloads so
far) - Binaries and source code
- Installed at NSF centers
- Large published simulations (e.g., aquaporin
simulation featured in keynote)
13Scaling NAMD
- Several QsNet challenges had to be overcome to
scale NAMD
14QsNet Challange Latency
Applications need to post receives for messages
of different sizes
15Latency Bottlenecks
- Latency
- Slow NIC processor with a 100Mhz clock
- Cache size only 8KB
- Traversing a large loop flushes it
1 86017
5 92475
9 103037
13 174060
17 1008003
Cache Misses vs Number of Receives Posted
16Managing Latency Message Combining
Organize processors in a 2D (virtual) Mesh
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
17NAMD PME Performance
Performance of Namd with the Atpase molecule. PME
step in Namd involves an a 192 X 144 processor
collective operation with 900 byte messages
18QsNet Challenge Bandwidth
QsNet Network Bandwidth 320 MB/s
MB/s
One Way 290
Two Way 128
PCI/DMA Contention restricts bandwidth on Alpha
servers
19Improving Bandwidth
Node bandwidth (MB/s) for different placements of
source and destination
Main-Main Elan-Main Elan-Elan
One Way 290 305 319
Two Way 128 305 319
Sending messages from Elan memory is faster
20QsNet Challenge Stretched Handlers
NAMD Timeline
- Stretched Sends
- Green superscripts
- Similar stretches observed in the middle of entry
methods
Processors
Time
21Stretching Solution
- Stretched Sends
- Elan Isend blocked when the rendezvous for the
previous Isend to any destination had not been
acknowledged - Solved the problem by closely working with
Quadrics and obtaining a patch - Isend only blocks on the rendezvous of the
previous message to the same destination
22Stretching Solution Contd.
- Stretches in the middle of entry methods
- Caused by OS daemons
- Using blocking receives minimized these stretches
- Daemons can be scheduled when processor is idle
23NAMD With Blocking Receives
Blocking Receives
Processors
Time
24NAMD Performance on Lemieux
25Summary
- QsNet and excellent network
- NIC co-processor ideal for message driven
execution - Programming guidelines
- Send messages from Elan memory
- Post limited number of receives and before the
sends - Blocking receives to avoid stretching
26Future Work
- One sided communication
- Barrier?
- Persistent one sided communication
- Reserve buffers on destination
27Fat Tree Topology
28Elan3 Adapter
- DMA Engine
- Thread Processor
- On chip shared cache
- 64 bit 66 Mhz PCI interface
- 64 MB RAM
29Object Based Communication Framework
Application
AMPI
Charm
Comm. Framework Object Layer
Performs Object Level Optimizations
Converse
Comm. Framework Processor Layer
Optimizes Inter-Processor Communication
Communication Layer
30AAPC Processor Overhead
Mesh Completion Time
Direct Completion Time
Direct CPU Overhead
Mesh CPU Overhead
Lower CPU overhead enables applications using
Mesh to perform better even for large messages