Berkeley NOW Project - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Berkeley NOW Project

Description:

100 node Ultra/Myrinet NOW. NOW 18. Massive Cheap Storage ... Ultra 2, 300 GB raid, 800 GB tape stacker, ATM. scalable backup/restore. Dedicated Info Servers ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 48
Provided by: DavidE1
Category:
Tags: now | berkeley | project | ultra

less

Transcript and Presenter's Notes

Title: Berkeley NOW Project


1
Berkeley NOW Project
  • David E. Culler
  • culler_at_cs.berkeley.edu
  • http//now.cs.berkeley.edu/
  • Sun Visit
  • May 1, 1998

2
Project Goals
  • Make a fundamental change in how we design and
    construct large-scale systems
  • market reality
  • 50/year performance growth gt cannot allow 1-2
    year engineering lag
  • technological opportunity
  • single-chip Killer Switch gt fast, scalable
    communication
  • Highly integrated building-wide system
  • Explore novel system design concepts in this new
    cluster paradigm

3
Remember the Killer Micro
Linpack Peak Performance
  • Technology change in all markets
  • At many levels Arch, Compiler, OS, Application

4
Another Technological Revolution
  • The Killer Switch
  • single chip building block for scalable networks
  • high bandwidth
  • low latency
  • very reliable
  • if its not unplugged
  • gt System Area Networks

5
One Example Myrinet
  • 8 bidirectional ports of 160 MB/s each way
  • lt 500 ns routing delay
  • Simple - just moves the bits
  • Detects connectivity and deadlock

Tomorrow gigabit Ethernet?
6
Potential Snap together large systems
  • incremental scalability
  • time / cost to market
  • independent failure gt availability

Node Performance in Large System
Engineering Lag Time
7
Opportunity Rethink O.S. Design
  • Remote memory and processor are closer to you
    than your own disks!
  • Networking Stacks ?
  • Virtual Memory ?
  • File system design ?

8
Example Traditional File System
Server
Fast Channel (HPPI)
Clients

RAID Disk Storage

Global Shared File Cache
   
Local Private File Cache
Bottleneck
  • Expensive
  • Complex
  • Non-Scalable
  • Single point of failure
  • Server resources at a premium
  • Client resources poorly utilized

9
Truly Distributed File System
Scalable Low-Latency Communication Network
Cluster Caching
Local Cache
Network RAID striping
G Node Comm BW / Disk BW
  • VM page to remote memory

10
Fast Communication Challenge
Killer Platform
  
ns
ms
µs
Killer Switch
  • Fast processors and fast networks
  • The time is spent in crossing between them

11
Opening Intelligent Network Interfaces
  • Dedicated Processing power and storage embedded
    in the Network Interface
  • An I/O card today
  • Tomorrow on chip?

Mryicom Net
160 MB/s
Myricom NIC
M
M
I/O bus (S-Bus) 50 MB/s
M
M

M



Sun Ultra 170

12
Our Attack Active Messages
Request
handler
Reply
handler
  • Request / Reply small active messages (RPC)
  • Bulk-Transfer (store get)
  • Highly optimized communication layer on a range
    of HW

13
NOW System Architecture
Parallel Apps
Large Seq. Apps
Sockets, Split-C, MPI, HPF, vSM
Global Layer UNIX
Process Migration
Distributed Files
Network RAM
Resource Management
UNIX Workstation
UNIX Workstation
UNIX Workstation
UNIX Workstation
Comm. SW
Comm. SW
Comm. SW
Comm. SW
Net Inter. HW
Net Inter. HW
Net Inter. HW
Net Inter. HW
Fast Commercial Switch (Myrinet)
14
Outline
  • Introduction to the NOW project
  • Quick tour of the NOW lab
  • Important new system design concepts
  • Conclusions
  • Future Directions

15
First HP/fddi Prototype
  • FDDI on the HP/735 graphics bus.
  • First fast msg layer on non-reliable network

16
SparcStation ATM NOW
  • ATM was going to take over the world.

The original INKTOMI
Today www.hotbot.com
17
100 node Ultra/Myrinet NOW
18
Massive Cheap Storage
  • Basic unit
  • 2 PCs double-ending four SCSI chains

Currently serving Fine Art at http//www.thinker.o
rg/imagebase/
19
Cluster of SMPs (CLUMPS)
  • Four Sun E5000s
  • 8 processors
  • 3 Myricom NICs
  • Multiprocessor, Multi-NIC, Multi-Protocol

20
Information Servers
  • Basic Storage Unit
  • Ultra 2, 300 GB raid, 800 GB tape stacker, ATM
  • scalable backup/restore
  • Dedicated Info Servers
  • web,
  • security,
  • mail,
  • VLANs project into dept.

21
Whats Different about Clusters?
  • Commodity parts?
  • Communications Packaging?
  • Incremental Scalability?
  • Independent Failure?
  • Intelligent Network Interfaces?
  • Complete System on every node
  • virtual memory
  • scheduler
  • files
  • ...

22
Three important system design aspects
  • Virtual Networks
  • Implicit co-scheduling
  • Scalable File Transfer

23
Communication Performance ? Direct Network
Access
Latency
1/BW
  • LogP Latency, Overhead, and Bandwidth
  • Active Messages lean layer supporting
    programming models

24
Example NAS Parallel Benchmarks
  • Better node performance than the Cray T3D
  • Better scalability than the IBM SP-2

25
General purpose requirements
  • Many timeshared processes
  • each with direct, protected access
  • User and system
  • Client/Server, Parallel clients, parallel servers
  • they grow, shrink, handle node failures
  • Multiple packages in a process
  • each may have own internal communication layer
  • Use communication as easily as memory

26
Virtual Networks
  • Endpoint abstracts the notion of attached to the
    network
  • Virtual network is a collection of endpoints that
    can name each other.
  • Many processes on a node can each have many
    endpoints, each with own protection domain.

27
How are they managed?
  • How do you get direct hardware access for
    performance with a large space of logical
    resources?
  • Just like virtual memory
  • active portion of large logical space is bound to
    physical resources

Host Memory
Process n
Processor

Process 3
Process 2
Process 1
NIC Mem
P
Network Interface
28
Solaris System Abstractions
  • Segment Driver
  • manages portions of an address space
  • Device Driver
  • manages I/O device

Virtual Network Driver
29
Virtualization is not expensive
30
Bursty Communication among many virtual networks
31
Sustain high BW with many VN
32
Perspective on Virtual Networks
  • Networking abstractions are vertical stacks
  • new function gt new layer
  • poke through for performance
  • Virtual Networks provide a horizontal abstraction
  • basis for build new, fast services

33
Beyond the Personal Supercomputer
  • Able to timeshare parallel programs
  • with fast, protected communication
  • Mix with sequential and interactive jobs
  • Use fast communication in OS subsystems
  • parallel file system, network virtual memory,
  • Nodes have powerful, local OS scheduler
  • Problem local schedulers do not know to run
    parallel jobs in parallel

34
Local Scheduling
  • Local Schedulers act independently
  • no global control
  • Program waits while trying communicate with peers
    that are not running
  • 10 - 100x slowdowns for fine-grain programs!
  • gt need coordinated scheduling

35
Traditional Solution Gang Scheduling
  • Global context switch according to precomputed
    schedule
  • Inflexible, inefficient, fault prone

36
Novel Solution Implicit Coscheduling
  • Coordinate schedulers using only the
    communication in the program
  • very easy to build
  • potentially very robust to component failures
  • inherently service on-demand
  • scalable
  • Local service component can evolve.

37
Why it works
  • Infer non-local state from local observations
  • React to maintain coordination
  • observation implication action
  • fast response partner scheduled spin
  • delayed response partner not scheduled block

38
Example Synthetic Pgms
  • Range of granularity and load imbalance
  • spin wait 10x slowdown

39
Implicit Coordination
  • Surprisingly effective
  • real programs
  • range of workloads
  • simple an robust
  • Opens many new research questions
  • fairness
  • How broadly can implicit coordination be applied
    in the design of cluster subsystems?

40
A look at Serious File I/O
  • Traditional I/O system
  • NOW I/O system
  • Benchmark Problem sort large number of 100 byte
    records with 10 byte keys
  • start on disk, end on disk
  • accessible as files (use the file system)
  • Datamation sort 1 million records
  • Minute sort quantity in a minute

Proc- Mem
P-M
P-M
P-M
P-M
41
World-Record Disk-to-Disk Sort
  • Sustain 500 MB/s disk bandwidth and 1,000 MB/s
    network bandwidth

42
Key Implementation Techniques
  • Performance Isolation highly tuned local
    disk-to-disk sort
  • manage local memory
  • manage disk striping
  • memory mapped I/O with m-advise, buffering
  • manage overlap with threads
  • Efficient Communication
  • completely hidden under disk I/O
  • competes for I/O bus bandwidth
  • Self-tuning Software
  • probe available memory, disk bandwidth, trade-offs

43
Towards a Cluster File System
  • Remote disk system built on a virtual network

Client
RD server
RDlib
Active msgs
44
Conclusions
  • Fast, simple Cluster Area Networks are a
    technological breakthrough
  • Complete system on every node makes clusters a
    very powerful architecture.
  • Extend the system globally
  • virtual memory systems,
  • schedulers,
  • file systems, ...
  • Efficient communication enables new solutions to
    classic systems challenges.

45
Millennium Computational Community
Business
SIMS
BMRC
Chemistry
C.S.
E.E.
Biology
Gigabit Ethernet
Astro
NERSC
M.E.
Physics
N.E.
Math
IEOR
Transport
Economy
C. E.
MSME
46
Millennium PC Clumps
  • Inexpensive, easy to manage Cluster
  • Replicated in many departments
  • Prototype for very large PC cluster

47
Proactive Infrastructure
Information appliances
Stationary desktops
Scalable Servers
Write a Comment
User Comments (0)
About PowerShow.com