The IBM Blue GeneL System Architecture - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

The IBM Blue GeneL System Architecture

Description:

Example usage: hydrodynamics. quantum chemistry. molecular dynamics. climate modeling ... Example: Chose ASIC die and chip pin-out to ease circuit card routing. ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 27
Provided by: Tun72
Category:

less

Transcript and Presenter's Notes

Title: The IBM Blue GeneL System Architecture


1
The IBM Blue Gene/L System Architecture
  • Presented by Sabri KANTAR

2
What is Blue Gene/L?
  • Blue Gene is an IBM Research project dedicated to
    exploring the frontiers in supercomputing.
  • In November 2004, the IBM Blue Gene computer
    became the fastest supercomputer in the world.
  • This project is designed to scale to 65,536
    dual-processor nodes, with a peak performance of
    360 TeraFLOPS.
  • Example usage
  • hydrodynamics
  • quantum chemistry
  • molecular dynamics
  • climate modeling
  • financial modeling

3
A High-Level View of the BG/L Architecture
  • Within node
  • Low latency, high bandwidth memory system.
  • Strong floating point performance 4 floating
    point operations/cycle.
  • Across nodes
  • Low latency, high bandwidth networks.
  • Many nodes
  • Low power/node.
  • Low cost/node.
  • RAS (reliability, availability and
    serviceability).
  • Familiar SW API
  • C, C, Fortan, MPI, POSIX subset,

4
Main Design Principles for Blue Gene/L
  • Some science engineering applications scale up
    to and beyond 10,000 parallel processes.
  • Improve computing capability, holding total
    system cost.
  • Reduce cost/FLOP.
  • Reduce complexity and size.
  • 25KW/rack is max for air-cooling in standard
    room.
  • Need to improve performance/power ratio.
  • 700MHz PowerPC440 for ASIC has excellent
    FLOP/Watt.
  • Maximize Integration
  • On chip ASIC with everything except main memory.
  • Off chip Maximize number of nodes in a rack..
  • Large systems require excellent reliability,
    availability, serviceability (RAS)

5
Main Design Principles (contd)
  • Make cost/performance trade-offs considering the
    end-use
  • Applications ltgt Architecture ltgt Packaging
  • Examples
  • 1 or 2 differential signals per torus link.
  • I.e. 1.4 or 2.8Gb/s.
  • Maximum of 3 or 4 neighbors on collective
    network.
  • I.e. Depth of network and thus global latency.
  • Maximize the overall system efficiency
  • Small team designed all of Blue Gene/L.
  • Example Chose ASIC die and chip pin-out to ease
    circuit card routing.

6
Reducing Cost and Complexity
  • Cables are bigger, costlier and less reliable
    than traces.
  • So want to minimize the number of cables.
  • So 3-dimensional torus is chosen as main BG/L
    network, with each node connected to 6 neighbors.
  • Maximize number of nodes connected via circuit
    card(s) only.
  • BG/L midplane has 888512 nodes.
  • (Number of cable connections) / (all connections)
  • (6 faces 8 8 nodes) / (6 neighbors 8 8
    8 nodes)
  • 1 / 8

7
Blue Gene/L Architecture
  • Up to 32326465536 nodes (3D torus).
  • Max 360 teraFLOPS computation power.
  • Each processor can perform 4 floating point
    operations per cycle (in the form of two 64-bit
    floating point multiply-adds per cycle)
  • 5 networks connect nodes to themselves and to the
    world.

8
Node Architecture
  • IBM PowerPC embedded CMOS processors, embedded
    DRAM, and system-on-a-chip technique is used.
  • 11.1-mm square die size, allowing for a very high
    density of processing.
  • The ASIC uses IBM CMOS CU-11 0.13 micron
    technology.
  • 700 Mhz processor speed close to memory speed.
  • Two processors per node.
  • Second processor is intended primarily for
    handling message passing operations

9
The BG/L node ASIC includes
  • The two processing cores are standard PowerPC 440
    core
  • each with a PowerPC 440 FP2 core
  • an enhanced Double 64-bit Floating-Point Unit
  • The two cores are not L1 cache coherent.
  • Each core has a small 2 KB L2 cache
  • 4 MB L3 cache made from embedded DRAM
  • An integrated external DDR memory controller
  • A gigabit Ethernet adapter
  • A JTAG interface

10
BlueGene/L node diagram.
11
Link ASIC
  • In addition to the compute ASIC, there is a
    link ASIC.
  • When crossing
  • a midplane boundary
  • BG/Ls torus
  • global combining tree
  • global interrupt signals pass through the BG/L
    link ASIC.
  • It redrives signals over the cables between BG/L
    midplanes.
  • The link ASIC can redirect signals between its
    different ports.
  • enables BG/L to be partitioned into multiple,
    logically separate systems in which there is no
    traffic interference between systems.

12
The PowerPC 440 FP2 core
  • It consists of a primary side and a secondary
    side
  • Each side has
  • its own 64-bit by 32 element register file
  • a double-precision computational datapath and
  • a double-precision storage access datapath
  • The primary side is capable of executing standard
    PowerPC floating-point instructions
  • An enhanced set of instructions include those
    that are executed solely on the secondary side,
    and those that are simultaneously executed on
    both sides.
  • Enhanced set includes SIMD operations

13
The FP2 core (contd)
  • This enhanced set goes beyond the capabilities of
    traditional SIMD architectures.
  • A single instruction can initiate a different but
    related operation on different data.
  • Single Instruction Multiple Operation Multiple
    Data (SIMOMD).
  • Either of the sides can access data from the
    other sides register file.
  • This saves a lot of swapping when working purely
    on complex arithmetic operations.

14
Memory System
  • It is designed for high bandwidth, low latency
    memory and cache accesses.
  • An L2 hit returns in 6 to 10 processor cycles
  • An L3 hit in about 25 cycles
  • An L3 miss in about 75 cycles
  • System has a 16 byte interface to nine 256Mb
    SDRAM-DDR devices.
  • Operating at a speed of one half or one third of
    the processor.

15
3D Torus Network
  • It is used for general-purpose, point-to-point
    message passing and multicast operations to a
    selected class of nodes.
  • The topology is a three-dimensional torus
    constructed with point-to-point, serial links
    between routers embedded within the BlueGene/L
    ASICs.
  • Each ASIC has six nearest-neighbor connections
  • Virtual cut-through routing with multipacket
    buffering on collision
  • Minimal, Adaptive, Deadlock Free

16
Torus Network (contd)
  • Class Routing Capability (Deadlock-free Hardware
    Multicast)
  • Packets can be deposited along route to specified
    destination.
  • Allows for efficient one to many in some
    instances
  • Active messages allows for fast transposes as
    required in FFTs.
  • Independent on-chip network interfaces enable
    concurrent access.

17
Other Networks
  • A global combining/broadcast tree for collective
    operations
  • A Gigabit Ethernet network for connection to
    other systems, such as hosts and file systems.
  • A global barrier and interrupt network
  • And another Gigabit Ethernet to JTAG network for
    machine control

18
Collective Network
  • It has tree structure
  • One-to-all broadcast functionality
  • Reduction operations functionality
  • 2.8 Gb/s of bandwidth per link Latency of tree
    traversal 2.5 µs
  • 23TB/s total binary tree bandwidth (64k machine)
  • Interconnects all compute and I/O nodes (1024)

19
Gb Ethernet Disk/Host I/O Network
  • IO nodes are leaves on collective network.
  • Compute and IO nodes use same ASIC, but
  • IO node has Ethernet not torus. Provedes IO
    seperation on application.
  • Compute node has torus, not Ethernet No need for
    65536 cables.
  • Configurable ratio of IO to compute
    18,16,32,64,128.
  • Application runs on compute nodes, not IO nodes.

20
Fast Barrier/Interrupt Network
  • Four Independent Barrier or Interrupt Channels
  • Independently Configurable as "or" or "and"
  • Asynchronous Propagation
  • Halt operation quickly (current estimate is
    1.3usec worst case round trip)
  • 3/4 of this delay is time-of-flight.
  • Sticky bit operation
  • Allows global barriers with a single channel.
  • User Space Accessible
  • System selectable
  • It is partitioned along same boundaries as Tree,
    and Torus
  • Each user partition contains it's own set of
    barrier/ interrupt signals

21
Control Network
  • JTAG interface to 100Mb Ethernet
  • direct access to all nodes.
  • boot, system debug availability.
  • runtime noninvasive RAS support.
  • non-invasive access to performance counters
  • direct access to shared SRAM in every node
  • Control, configuration and monitoring
  • Make all active devices accessible through JTAG,
    I2C, or other simple bus. (Only clock buffers
    DRAM are not accessible)

22
Packaging
  • 2 nodes per compute card.
  • 16 compute cards per node board.
  • 16 node boards per 512-node midplane.
  • Two midplanes in a 1024-node rack.
  • For compiling, diagnostics, and analysis, a host
    computer is required.
  • An I/O node handles communication between a
    compute node and other systems, including the
    host and file servers.

23
BlueGene/L packaging.
24
Science Application
  • Study of protein folding and dynamics.
  • Aim is to obtain a microscopic view of the
    thermodynamics and kinetics of the folding
    process
  • Simulating longer and longer time-scales is the
    key challenge
  • Focus is on improving the speed of execution for
    a fixed size system by utilizing additional CPUs.
  • Understanding the logical limits to concurrency
    within the application is very important.

25
Conclusion
  • The Blue Gene/L supercomputer is designed to
    improve cost/performance for a relatively broad
    class of applications with good scaling behavior.
  • This is achieved by using parallesim.
  • System on Chip technology.
  • The functionality of a node was contained within
    a single ASIC chip.
  • BG/L has significantly lower cost in terms of
    power, space, and service, while doing no worse
    than the other competitors.

26
The End
  • Questions ???
Write a Comment
User Comments (0)
About PowerShow.com