Opportunities%20and%20Challenges%20of%20Modern%20Communication%20Architectures:%20Case%20Study%20with%20QsNet%20%20CAC%20Workshop%20Santa%20Fe,%20NM,%202004%20%20Sameer%20Kumar*%20and%20Laxmikant%20V.%20Kale%20Parallel%20Programming%20Laboratory%20University%20of%20Illinois%20at%20Urbana%20Champaign

About This Presentation

Title:

Opportunities%20and%20Challenges%20of%20Modern%20Communication%20Architectures:%20Case%20Study%20with%20QsNet%20%20CAC%20Workshop%20Santa%20Fe,%20NM,%202004%20%20Sameer%20Kumar*%20and%20Laxmikant%20V.%20Kale%20Parallel%20Programming%20Laboratory%20University%20of%20Illinois%20at%20Urbana%20Champaign

Description:

1. Opportunities and Challenges of Modern Communication Architectures: Case Study ... Adaptive worm hole routing. 35 ns per hop. 6. Elan Network Adaptor. Features ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 27

Provided by: same5

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Opportunities%20and%20Challenges%20of%20Modern%20Communication%20Architectures:%20Case%20Study%20with%20QsNet%20%20CAC%20Workshop%20Santa%20Fe,%20NM,%202004%20%20Sameer%20Kumar*%20and%20Laxmikant%20V.%20Kale%20Parallel%20Programming%20Laboratory%20University%20of%20Illinois%20at%20Urbana%20Champaign

1
Opportunities and Challenges of Modern
Communication Architectures Case Study with
QsNet CAC WorkshopSanta Fe, NM, 2004 Sameer
Kumar and Laxmikant V. KaleParallel Programming
LaboratoryUniversity of Illinois at Urbana
Champaign
2
Outline

Processor virtualization
QsNet
Opportunities
Performance Evaluation of QsNet
Challenges of QsNet
Summary

3
Processor Virtualization

Basic idea of processor virtualization
User specifies interaction between objects (VPs)
RTS maps VPs onto physical processors
Typically, virtual processors gt processors
Embodied in Charm and AMPI

4
QsNet

Popular interconnect from Quadrics
Several parallel systems in top500 use QsNet
Pittsburghs Lemieux (6TF)
ASCI-Q (20TF)
Elite network
Elan adaptor

5
Elite Network

320 MB/s each way after protocol
Reliable fat-tree network
Multiple routes provides fault tolerance
Adaptive worm hole routing
35 ns per hop

6
Elan Network Adaptor

Features
Low latency (4.5 µs for MPI)
High bandwidth (320MB/s/node)
Components
Sparc processor
DMA Engine
64 MB RAM
On chip cache

7
Low CPU Overhead
CPU Overhead is small and does not change much
with the message size
8
Traditional Message Passing
Send Overhead
Receive Overhead
P0
P1
Time
Traditional Message Passing does not utilize low
CPU overhead of Elan
9
Adaptive Overlap
Send Overhead
Receive Overhead
P0
VP0
VP1
VP0
VP1
Time
Processor Virtualization takes full advantage of
the low CPU overhead of Elan
10
Benefit of Adaptive Overlap
Problem setup 3D stencil calculation of size
2403 run on Lemieux. Shows AMPI with
virtualization ratio of 1 and 8.
11
Charm Message Driven Execution
Handler
Scheduler
Pump
Garbage Collection
Send
12
NAMD A Production MD System

Written in Charm
Fully featured program
NIH-funded development
Distributed free of charge (5000 downloads so
far)
Binaries and source code
Installed at NSF centers
Large published simulations (e.g., aquaporin
simulation featured in keynote)

13
Scaling NAMD

Several QsNet challenges had to be overcome to
scale NAMD

14
QsNet Challange Latency
Applications need to post receives for messages
of different sizes
15
Latency Bottlenecks

Latency
Slow NIC processor with a 100Mhz clock
Cache size only 8KB
Traversing a large loop flushes it

1 86017
5 92475
9 103037
13 174060
17 1008003
Cache Misses vs Number of Receives Posted
16
Managing Latency Message Combining
Organize processors in a 2D (virtual) Mesh
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
17
NAMD PME Performance
Performance of Namd with the Atpase molecule. PME
step in Namd involves an a 192 X 144 processor
collective operation with 900 byte messages
18
QsNet Challenge Bandwidth
QsNet Network Bandwidth 320 MB/s
MB/s
One Way 290
Two Way 128
PCI/DMA Contention restricts bandwidth on Alpha
servers
19
Improving Bandwidth
Node bandwidth (MB/s) for different placements of
source and destination
Main-Main Elan-Main Elan-Elan
One Way 290 305 319
Two Way 128 305 319
Sending messages from Elan memory is faster
20
QsNet Challenge Stretched Handlers
NAMD Timeline

Stretched Sends
Green superscripts
Similar stretches observed in the middle of entry
methods

Processors
Time
21
Stretching Solution

Stretched Sends
Elan Isend blocked when the rendezvous for the
previous Isend to any destination had not been
acknowledged
Solved the problem by closely working with
Quadrics and obtaining a patch
Isend only blocks on the rendezvous of the
previous message to the same destination

22
Stretching Solution Contd.

Stretches in the middle of entry methods
Caused by OS daemons
Using blocking receives minimized these stretches
Daemons can be scheduled when processor is idle

23
NAMD With Blocking Receives
Blocking Receives
Processors
Time
24
NAMD Performance on Lemieux
25
Summary

QsNet and excellent network
NIC co-processor ideal for message driven
execution
Programming guidelines
Send messages from Elan memory
Post limited number of receives and before the
sends
Blocking receives to avoid stretching

26
Future Work

One sided communication
Barrier?
Persistent one sided communication
Reserve buffers on destination

27
Fat Tree Topology
28
Elan3 Adapter

DMA Engine
Thread Processor
On chip shared cache
64 bit 66 Mhz PCI interface
64 MB RAM

29
Object Based Communication Framework
Application
AMPI
Charm
Comm. Framework Object Layer
Performs Object Level Optimizations
Converse
Comm. Framework Processor Layer
Optimizes Inter-Processor Communication
Communication Layer
30
AAPC Processor Overhead
Mesh Completion Time
Direct Completion Time
Direct CPU Overhead
Mesh CPU Overhead
Lower CPU overhead enables applications using
Mesh to perform better even for large messages

Write a Comment

User Comments (0)

About PowerShow.com