Opportunities%20and%20Challenges%20of%20Modern%20Communication%20Architectures:%20Case%20Study%20with%20QsNet%20%20CAC%20Workshop%20Santa%20Fe,%20NM,%202004%20%20Sameer%20Kumar*%20and%20Laxmikant%20V.%20Kale%20Parallel%20Programming%20Laboratory%20University%20of%20Illinois%20at%20Urbana%20Champaign - PowerPoint PPT Presentation

About This Presentation
Title:

Opportunities%20and%20Challenges%20of%20Modern%20Communication%20Architectures:%20Case%20Study%20with%20QsNet%20%20CAC%20Workshop%20Santa%20Fe,%20NM,%202004%20%20Sameer%20Kumar*%20and%20Laxmikant%20V.%20Kale%20Parallel%20Programming%20Laboratory%20University%20of%20Illinois%20at%20Urbana%20Champaign

Description:

1. Opportunities and Challenges of Modern Communication Architectures: Case Study ... Adaptive worm hole routing. 35 ns per hop. 6. Elan Network Adaptor. Features ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Opportunities%20and%20Challenges%20of%20Modern%20Communication%20Architectures:%20Case%20Study%20with%20QsNet%20%20CAC%20Workshop%20Santa%20Fe,%20NM,%202004%20%20Sameer%20Kumar*%20and%20Laxmikant%20V.%20Kale%20Parallel%20Programming%20Laboratory%20University%20of%20Illinois%20at%20Urbana%20Champaign


1
Opportunities and Challenges of Modern
Communication Architectures Case Study with
QsNet CAC WorkshopSanta Fe, NM, 2004 Sameer
Kumar and Laxmikant V. KaleParallel Programming
LaboratoryUniversity of Illinois at Urbana
Champaign
2
Outline
  • Processor virtualization
  • QsNet
  • Opportunities
  • Performance Evaluation of QsNet
  • Challenges of QsNet
  • Summary

3
Processor Virtualization
  • Basic idea of processor virtualization
  • User specifies interaction between objects (VPs)
  • RTS maps VPs onto physical processors
  • Typically, virtual processors gt processors
  • Embodied in Charm and AMPI

4
QsNet
  • Popular interconnect from Quadrics
  • Several parallel systems in top500 use QsNet
  • Pittsburghs Lemieux (6TF)
  • ASCI-Q (20TF)
  • Elite network
  • Elan adaptor

5
Elite Network
  • 320 MB/s each way after protocol
  • Reliable fat-tree network
  • Multiple routes provides fault tolerance
  • Adaptive worm hole routing
  • 35 ns per hop

6
Elan Network Adaptor
  • Features
  • Low latency (4.5 µs for MPI)
  • High bandwidth (320MB/s/node)
  • Components
  • Sparc processor
  • DMA Engine
  • 64 MB RAM
  • On chip cache

7
Low CPU Overhead
CPU Overhead is small and does not change much
with the message size
8
Traditional Message Passing
Send Overhead
Receive Overhead
P0
P1
Time
Traditional Message Passing does not utilize low
CPU overhead of Elan
9
Adaptive Overlap
Send Overhead
Receive Overhead
P0
VP0
VP1
VP0
VP1
Time
Processor Virtualization takes full advantage of
the low CPU overhead of Elan
10
Benefit of Adaptive Overlap
Problem setup 3D stencil calculation of size
2403 run on Lemieux. Shows AMPI with
virtualization ratio of 1 and 8.
11
Charm Message Driven Execution
Handler
Scheduler
Pump
Garbage Collection
Send
12
NAMD A Production MD System
  • Written in Charm
  • Fully featured program
  • NIH-funded development
  • Distributed free of charge (5000 downloads so
    far)
  • Binaries and source code
  • Installed at NSF centers
  • Large published simulations (e.g., aquaporin
    simulation featured in keynote)

13
Scaling NAMD
  • Several QsNet challenges had to be overcome to
    scale NAMD

14
QsNet Challange Latency
Applications need to post receives for messages
of different sizes
15
Latency Bottlenecks
  • Latency
  • Slow NIC processor with a 100Mhz clock
  • Cache size only 8KB
  • Traversing a large loop flushes it

1 86017
5 92475
9 103037
13 174060
17 1008003
Cache Misses vs Number of Receives Posted
16
Managing Latency Message Combining
Organize processors in a 2D (virtual) Mesh
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
17
NAMD PME Performance
Performance of Namd with the Atpase molecule. PME
step in Namd involves an a 192 X 144 processor
collective operation with 900 byte messages
18
QsNet Challenge Bandwidth
QsNet Network Bandwidth 320 MB/s
MB/s
One Way 290
Two Way 128
PCI/DMA Contention restricts bandwidth on Alpha
servers
19
Improving Bandwidth
Node bandwidth (MB/s) for different placements of
source and destination
Main-Main Elan-Main Elan-Elan
One Way 290 305 319
Two Way 128 305 319
Sending messages from Elan memory is faster
20
QsNet Challenge Stretched Handlers
NAMD Timeline
  • Stretched Sends
  • Green superscripts
  • Similar stretches observed in the middle of entry
    methods

Processors
Time
21
Stretching Solution
  • Stretched Sends
  • Elan Isend blocked when the rendezvous for the
    previous Isend to any destination had not been
    acknowledged
  • Solved the problem by closely working with
    Quadrics and obtaining a patch
  • Isend only blocks on the rendezvous of the
    previous message to the same destination

22
Stretching Solution Contd.
  • Stretches in the middle of entry methods
  • Caused by OS daemons
  • Using blocking receives minimized these stretches
  • Daemons can be scheduled when processor is idle

23
NAMD With Blocking Receives
Blocking Receives
Processors
Time
24
NAMD Performance on Lemieux
25
Summary
  • QsNet and excellent network
  • NIC co-processor ideal for message driven
    execution
  • Programming guidelines
  • Send messages from Elan memory
  • Post limited number of receives and before the
    sends
  • Blocking receives to avoid stretching

26
Future Work
  • One sided communication
  • Barrier?
  • Persistent one sided communication
  • Reserve buffers on destination

27
Fat Tree Topology
28
Elan3 Adapter
  • DMA Engine
  • Thread Processor
  • On chip shared cache
  • 64 bit 66 Mhz PCI interface
  • 64 MB RAM

29
Object Based Communication Framework
Application
AMPI
Charm
Comm. Framework Object Layer
Performs Object Level Optimizations
Converse
Comm. Framework Processor Layer
Optimizes Inter-Processor Communication
Communication Layer
30
AAPC Processor Overhead
Mesh Completion Time
Direct Completion Time
Direct CPU Overhead
Mesh CPU Overhead
Lower CPU overhead enables applications using
Mesh to perform better even for large messages
Write a Comment
User Comments (0)
About PowerShow.com