HighPerformance Networking HPN Group - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

HighPerformance Networking HPN Group

Description:

Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters ... Radix sort. Measures set up time, sort time, and time to verify the sort ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 19
Provided by: dral60
Category:

less

Transcript and Presenter's Notes

Title: HighPerformance Networking HPN Group


1
High-Performance Networking (HPN) Group
The HPN Group was formerly known as the SAN
group.
  • Distributed Shared-Memory Parallel Computing with
    UPC on SAN-based Clusters
  • Appendix for Q3 Status Report
  • DOD Project MDA904-03-R-0507
  • February 5, 2004

2
Outline
  • Objectives and Motivations
  • Background
  • Related Research
  • Approach
  • Results
  • Conclusions and Future Plans

3
Objectives and Motivations
  • Objectives
  • Support advancements for HPC with Unified
    Parallel C (UPC) on cluster systems exploiting
    high-throughput, low-latency system-area networks
    (SANs) and LANs
  • Design and analysis of tools to support UPC on
    SAN-based systems
  • Benchmarking and case studies with key UPC
    applications
  • Analysis of tradeoffs in application, network,
    service and system design
  • Motivations
  • Increasing demand in sponsor and scientific
    computing community for shared-memory parallel
    computing with UPC
  • New and emerging technologies in system-area
    networking and cluster computing
  • Scalable Coherent Interface (SCI)
  • Myrinet (GM)
  • InfiniBand
  • QsNet (Quadrics Elan)
  • Gigabit Ethernet and 10 Gigabit Ethernet
  • PCI Express (3GIO)
  • Clusters offer excellent cost-performance
    potential

4
Background
  • Key sponsor applications and developments toward
    shared-memory parallel computing with UPC
  • More details from sponsor are requested
  • UPC extends the C language to exploit parallelism
  • Currently runs best on shared-memory
    multiprocessors (notably HP/Compaqs UPC
    compiler)
  • First-generation UPC runtime systems becoming
    available for clusters (MuPC, Berkeley UPC)
  • Significant potential advantage in
    cost-performance ratio with COTS-based cluster
    configurations
  • Leverage economy of scale
  • Clusters exhibit low cost relative to
    tightly-coupled SMP, CC-NUMA, and MPP systems
  • Scalable performance with commercial
    off-the-shelf (COTS) technologies

5
Related Research
  • University of California at Berkeley
  • UPC runtime system
  • UPC to C translator
  • Global-Address Space Networking (GASNet) design
    and development
  • Application benchmarks
  • George Washington University
  • UPC specification
  • UPC documentation
  • UPC testing strategies, testing suites
  • UPC benchmarking
  • UPC collective communications
  • Parallel I/O
  • Michigan Tech University
  • Michigan Tech UPC (MuPC) design and development
  • UPC collective communications
  • Memory model research
  • Programmability studies
  • Test suite development
  • Ohio State University
  • UPC benchmarking
  • HP/Compaq
  • UPC compiler
  • Intrepid
  • GCC UPC compiler

6
Approach
  • Collaboration
  • HP/Compaq UPC Compiler V2.1 running in lab on new
    ES80 AlphaServer (Marvel)
  • Support of testing by OSU, MTU, UCB/LBNL, UF, et
    al. with leading UPC tools and system for
    function performance evaluation
  • Field test of newest compiler and system
  • Benchmarking
  • Use and design of applications in UPC to grasp
    key concepts and understand performance issues
  • Exploiting SAN Strengths for UPC
  • Design and develop new SCI Conduit for GASNet in
    collaboration UCB/LBNL
  • Evaluate DSM for SCI as option of executing UPC
  • Performance Analysis
  • Network communication experiments
  • UPC computing experiments
  • Emphasis on SAN Options and Tradeoffs
  • SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE,
    etc.

7
GASNet - Experimental Setup Analysis
  • Experimental Results
  • Throughput
  • Elan shows best performance with approx. 300MB/s
    in both put and get operations
  • Myrinet and SCI very close with 200MB/s on put
    operations
  • Myrinet obtains nearly the same performance with
    get operations
  • SCI suffers from the reference extended API in
    get operations (approx 7MB/s) due to greatly
    increasing latency
  • get operations will benefit the greatest from
    extended API implementation
  • Currently being addressed in UFs design of the
    extended API for SCI
  • MPI suffers from high latency but still performs
    well on GigE with almost 50 MB/s
  • Latency
  • Elan again performs best put/get(8 µs)
  • Myrinet put (20usec), get (33 µs)
  • SCI both put and get (25 µs) better than
    Myrinet get for small messages
  • Larger messages suffer from the AM rpc protocol
  • MPI latency too high to show (250 µs)
  • Elan is the best performer in low-level API tests
  • Testbed
  • Elan, MPI and SCI conduits
  • Dual 2.4 GHz Intel Xeon, 1GB DDR PC2100 (DDR266)
    RAM, Intel SE7501BR2 server motherboard with
    E7501 chipset
  • Specs 667 MB/s (300MB/s sustained) Dolphin SCI
    D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus
  • Specs 528 MB/s (340MB/s sustained) Elan3 ,using
    PCI-X in two nodes with QM-S16 16 port switch
  • RedHat 9.0 with gcc compiler V 3.3.2
  • GM (Myrinet) conduit (c/o access to cluster at
    MTU)
  • Dual 2.0 GHz Intel Xeon, 2GB DDR PC2100 (DDR266)
    RAM
  • Specs - 250 MB/s Myrinet 2000, using PCI-X, on
    8 nodes connected with 16-port M3F-SW16 switch
  • RedHat 7.3 with Intel C compiler V 7.1
  • Experimental Setup
  • Elan, GM conduits executed with extended API
    implemented
  • SCI, MPI executed with the reference API (based
    on AM in core API)
  • GASNet Conduit experiments
  • Berkeley GASNet Test suite
  • Average of 1000 iterations
  • Each uses Bulk transfers to take advantage of
    implemented extended APIs
  • Latency results use testsmall

Testbed made available by Michigan Tech
8
GASNet Throughput on Conduits
For get operations, must wait for rpc to be
executed before data can be pushed back
9
GASNet Latency on Conduits
Despite not having yet constructed the extended
API, which allows better hardware exploitation,
SCI conduit still manages to keep pace with GM
conduit for throughput and most small-message
latencies. Q1 report shows target possibility of
10usec latencies.
SCI results based on generic GASNet version of
extended API, which limits performance.
10
UPC Benchmarks IS from NAS benchmarks
  • Class A executed with Berkeley UPC runtime system
    V1.1 with gcc V3.3.2 for Elan, MPI Intel V7.1
    for GM
  • IS (Integer Sort), lots of fine-grain
    communication, low computation
  • Communication layer should have greatest effect
    on performance
  • Single thread shows performance without use of
    communication layer
  • Poor performance in the GASNet communication
    system does NOT necessary indicating poor
    performance in UPC application
  • MPI results poor for GASNet but decent for UPC
    applications
  • Application may need to be larger to confirm this
    assertion
  • GM conduit shows greatest gain in parallelization
    (could be partly due to better compiler)

Only two nodes available with Elan, unable to
determine scalability at this point
TCP/IP overhead outweighs benefit of
parallelization
Code developed at GWU
11
Network Performance Tests
  • Detailed understanding of high-performance
    cluster interconnects
  • Identifies suitable networks for UPC over
    clusters
  • Aids in smooth integration of interconnects with
    upper-layer UPC components
  • Enables optimization of network communication
    unicast and collective
  • Various levels of network performance analysis
  • Low-level tests
  • InfiniBand based on Virtual Interface Provider
    Library (VIPL)
  • SCI based on Dolphin SISCI and SCALI SCI
  • Myrinet based on Myricom GM
  • QsNet based on Quadrics Elan Communication
    Library
  • Host architecture issues (e.g. CPU, I/O, etc.)
  • Mid-level tests
  • Sockets
  • Dolphin SCI Sockets on SCI
  • BSD Sockets on Gigabit and 10Gigabit Ethernet
  • GM Sockets on Myrinet
  • SOVIA on InfiniBand
  • MPI
  • InfiniBand and Myrinet based on MPI/PRO

12
Network Performance Tests
  • Tests run on two Elan3 cards connected by QM-S16
    16-port switch
  • Quadrics dping used for raw tests
  • GASNet testsmall used for latency, testlarge for
    throughput
  • Utilizes extended API
  • Results obtained from put operations
  • Elan conduit for GASNet more than doubles
    hardware latency, but still maintains sub-10 µs
    for small messages
  • Conduit throughput matches hardware
  • Elan conduit does not add appreciably to
    performance overhead

13
Low Level vs. GASNet Conduit
  • Tests run on two Myrinet 2000 cards connected by
    M3F-SW16 switch
  • Myricom gm_allsize used for raw tests
  • GASNet testsmall used for latency, testlarge for
    throughput
  • Utilizes extended API
  • Results obtained from puts
  • GM conduit almost doubles the hardware latency,
    with latencies of 19 µs for small messages
  • Conduit throughput follows trend of hardware but
    differs by an average of 60MB/s for messages
    1024bytes
  • Conduit peaks at 204MB/s compared to 238MB/s for
    hardware
  • GM conduit adds a small amount to performance
    overhead

14
Architectural Performance Tests
  • Opteron
  • Features
  • 64-bit processor
  • Real-time support of 32-bit OS
  • On-chip memory controllers
  • Eliminates 4 GB memory barrier imposed by 32-bit
    systems
  • 19.2 GB/s I/O bandwidth per processor
  • Future plans
  • UPC and other parallel application benchmarks
  • Pentium 4 Xeon
  • Features
  • 32-bit processor
  • Hyper-Threading technology
  • Increased CPU utilization
  • Intel NetBurst microarchitecture
  • RISC processor core
  • 4.3 GB/s I/O bandwidth
  • Future Plans
  • UPC and other parallel application benchmarks

15
CPU Performance Results
  • NAS Benchmarks
  • Computationally intensive
  • EP, FT, and IS
  • Class A problem set size
  • Opteron and Xeon comparable with floating-point
    operations (FT)
  • For integer operations, Opteron performs better
    compared to Xeon (EP IS)
  • DOD Seminumeric Benchmark 2
  • Radix sort
  • Measures set up time, sort time, and time to
    verify the sort
  • Sorting is the dominant component of execution
    time
  • Results Analysis
  • Opteron architecture outperforms Xeons in all
    tests performed for all iterations
  • Setup and Verify times around half as much as
    Xeon architecture

16
Memory Performance Results
  • Lmbench-3.0-a3
  • Opteron latency/throughput worsen as expected at
    size 64KB (L1 cache size) and 1MB (L2 cache size)
  • Xeon latency/throughput shows the same trend for
    L1 (8KB) but starts earlier for L2 (256KB instead
    of 512KB)
  • Cause under investigation
  • Between CPU / L1 / L2, Opteron outperforms Xeon,
    but Xeon outperforms Opteron when loading data
    from disk into main memory
  • Write throughput for Xeon stays relatively
    constant for size lt L2 cache size suggesting
    write-through policy use between L1 and L2
  • Xeon read gt Opteron write gt Opteron read gt Xeon
    write

17
File I/O Results
  • Bonnie / Bonnie
  • 10 iterations of writing and reading a 2GB file
    using per character functions and efficient block
    functions
  • stdio overhead great for per character functions
  • Efficient block reads and writes greatly reduce
    the CPU utilization
  • Throughput results were directly proportional to
    CPU utilization
  • Shows the same trend as observed in the memory
    performance testing
  • Xeon read gt Opteron write gt Opteron read gt Xeon
    write
  • Suggesting memory access and I/O access might
    utilizes the same mechanism
  • AIM 9
  • 10 iterations using 5MB files testing sequential
    and random reads, writes, and copies
  • Opteron consistently outperforming Xeon by a wide
    margin
  • Large increase in performance for disk reads as
    compare to write
  • Xeon read speeds are very high for all results
    with a much lower write performance
  • Opteron read speeds are also very high and
    greatly outperform the Xeons in write performance
    in all cases
  • Xeon sequential read is actually worse than
    Opteron, but still comparable

18
Conclusions and Future Plans
  • Accomplishments to date
  • Baselining of UPC on shared-memory
    multiprocessors
  • Evaluation of promising tools for UPC on clusters
  • Leverage and extend communication and UPC layers
  • Conceptual design of new tools
  • Preliminary network and system performance
    analyses
  • Completed V1.0 of the GASNet Core API SCI conduit
    for UPC
  • Key insights
  • Inefficient communication system does not
    necessarily translate to poor UPC application
    performance
  • Xeon cluster suitable for applications with high
    Read/Write ratio
  • Opteron cluster suitable for generic application
    due to comparable Read/Write capability
  • Future Plans
  • Comprehensive performance analysis of new SANs
    and SAN-based clusters
  • Evaluation of UPC methods and tools on various
    architectures and systems
  • UPC benchmarking on cluster architectures,
    networks, and conduits
  • Continuing effort in stabilizing/optimizing
    GASNet SCI Conduit
  • Cost/Performance analysis for all options
Write a Comment
User Comments (0)
About PowerShow.com