How Computer Architecture Trends May Affect Future Distributed Systems - PowerPoint PPT Presentation

About This Presentation
Title:

How Computer Architecture Trends May Affect Future Distributed Systems

Description:

How Computer Architecture Trends May Affect Future Distributed Systems Mark D. Hill Computer Sciences Department University of Wisconsin--Madison – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 42
Provided by: researchC2
Category:

less

Transcript and Presenter's Notes

Title: How Computer Architecture Trends May Affect Future Distributed Systems


1
How Computer Architecture TrendsMay Affect
Future Distributed Systems
  • Mark D. Hill
  • Computer Sciences Department
  • University of Wisconsin--Madison
  • http//www.cs.wisc.edu/markhill
  • PODC 00 Invited Talk

2
Three Questions
  • What is a System Area Network (SAN)and how will
    it affect clusters?
  • E.g., InfiniBand
  • How fat will multiprocessor servers beand how to
    we build larger ones?
  • E.g. Wisconsin Multifacets Multicast Timestamp
    Snooping
  • Future of multiprocessor servers clusters?
  • A merging of both?

3
Outline
  • Motivation
  • System Area Networks
  • Designing Multiprocessor Servers
  • Server Cluster Trends

4
Technology Push Moores Law
  • What do following intervals have in common?
  • Prehistory to 2000
  • 2001 to 2002
  • Answer Equal progress in absolute processor
    speed(and more doubling 2003-4, 2005-6, etc.)
  • Consider salary doubling
  • Corollary Cost halves every two years
  • Jim Gray In a decade you can buy a computerfor
    less than its sales tax today

5
Application Pull
  • Should use computers in currently wasteful ways
  • Already computers in electric razors greeting
    cards
  • New business models
  • B2C, B2B, C2B, C2C
  • Mass customization
  • More proactive (beyond interactive) Tennenhouse
  • Today P2C where PPerson CComputer
  • More C2P mattress adjusts to save your back
  • More C2C Agents surf the web for optimal deal
  • More sensors (physical/logic worlds coupled)
  • More hidden computers (c.f., electric motors)
  • Furthermore, I am wrong

6
The Internet Iceberg
  • Internet Components
  • Clients -- mobile, wireless
  • On Ramp -- LANs/DSL/Cable Modems
  • WAN Backbone -- IPv6, massive BW
  • and ...
  • SERVICES
  • Scale Storage
  • Scale Bandwidth
  • Scale Computation
  • High Availability

7
Outline
  • Motivation
  • System Area Networks
  • What is a SAN?
  • InfiniBand
  • Virtualizing I/O with Queue Pairs
  • Predictions
  • Designing Multiprocessor Servers
  • Server Cluster Trends

8
Regarding Storage/Bandwidth
  • Currently resides on I/O Bus (PCI)
  • HW SW protocol stacks
  • Must add hosts to add storage/bandwidth

bridge
i/o bus
i/o slot 0
i/o slot n-1
9
Want System Area Network (SAN)
  • SAN vs. Local Area Nework (LAN)
  • Higher bandwidth (10 Gbps)
  • Lower latency (few microseconds or less)
  • More limited size
  • Other (e.g., single administrative domain, short
    distance)
  • Examples Tandem Servernet Myricom Myrinet
  • Emerging Standard InfiniBand
  • www.inifinibandTA.org w/ spec 1.0 Summer 2000
  • Compaq, Dell, HP, IBM, Intel, Microsoft, Sun,
    others
  • 2.5 Gbits/s times 1, 4, or 12 wires

10
InfiniBand Model (from website)
target (disks)
TCA
11
Inifiniband Advantages
  • Storage/Network made orthogonal from Computation
  • Reduce hardware stack -- no i/o bridge
  • Reduce software stack hardware support for
  • Connected Reliable
  • Connected Unreliable
  • Datagram
  • Reliable Datagram
  • Raw Datagram
  • Can eliminate system call for SAN use (next slide)

12
Virtualizing InfiniBand
  • I/O traditionally virtualized with system call
  • System enforces isolation
  • System permits authorized sharing
  • Memory virtualized
  • System trap/call for setup
  • Virtual memory hardware for common-case
    translation
  • Infiniband exploits queue pairs (QPs) in memory
  • C.f., Intel Virtual Interface Architecture
    (VIA)IEEE Micro, Mar/Apr 98
  • Users issue sends, receives, remote DMA
    reads/writes

13
Queue Pair
  • QP setup system call
  • Connect with process
  • Connect with remote QP(not shown here)
  • QP placed in pinned virtual memory
  • User directly access QP
  • E.g., sends, receives remote DMA reads/writes

proc
Main Memory
dma-W4
dma-R3
send2
receive1
send1
receive2
HCA
14
InfiniBand, cont.
  • Roadmap
  • NGIO/FIO merger in 99
  • Spec in 00
  • Products in 03-10
  • My Assessment
  • PCI needs successor
  • InfiniBand has the necessary features (but also
    many others)
  • InifiniBand has considerable industry buy-in (but
    it is recent)
  • Gigabit Ethernet will be only competitor
  • Good name with backing from Cisco et al.
  • But TCP/IP is a killer
  • Infiniband for storage will be key

15
InfiniBand Research Issues
  • Software Wide Open
  • Industry will do local optimization(e.g., still
    have device driver virtualized with system calls)
  • But what is the right way to do software?
  • Is there a theoretical model for this software?
  • Other SAN Issues
  • A theoretical model of a service-providers site?
  • How to trade performance and availability?
  • Utility of broadcast or multicast support?
  • Obtaining quasi-real-time performance?

16
Outline
  • Motivation
  • System Area Networks
  • Designing Multiprocessor Servers
  • How Fat?
  • Coherence for Servers
  • E.g., Multicast Snooping
  • E.g., Timestamp Snooping
  • Server Cluster Trends

17
How Fat Should Servers Be?
  • Use
  • PCs -- cheap but small
  • Workgroup servers -- medium cost medium size
  • Large servers -- premium cost size
  • One answer yes

18
How Do We Build the Big Servers?
  • (Industry knows how to build the small ones)
  • A key problem is the memory system
  • Memory Wall E.g., 100ns memory access 400
    instruction opportunities for 4-way 1GHz
    processor
  • Use per-processor caches to reduce
  • Effective Latency
  • Effective Bandwidth Used
  • But cache coherence problem ...

19
Coherence 101
4
4
r0lt-m100
r1lt-m100
m100lt-5
X 5
100 4
100 4
interconnection network
memory
memory
100 4
20
Broadcast Snooping
P2GETX
P2GETX
data
data
data
21
Broadcast Snooping
  • Symmetric Multiprocessor (SMP)
  • Most commercially-successful parallel computer
    architecture
  • Performs well by finding data directly
  • Scales poorly
  • Improvements, e.g., Sun E10000
  • Split address data transactions
  • Split address data network (e.g., bus
    crossbar)
  • Multiple address buses (e.g., four multiplexed by
    address)
  • Address bus is broadcast tree (not shared wires)
  • But
  • Broadcast all address transactions (expensive)
  • All processors must snoop all transactions

22
Directories
P2GETX P1GETX
P2GETX
data
data
data
23
Directories
  • Directory Based Cache Coherence
  • E.g., SGI/Cray Origin2000
  • Allows arbitrary point-to-point interconnection
    network
  • Scales up well
  • But
  • Cache-to-cache transfers common in demanding
    apps(55-62 sharing misses for OLTP Barroso
    ISCA 98)
  • Many applications cant use 100s of processors
  • Must also scale down well

24
Wisconsin Multifacet Big Picture
  • Build Servers For Internet economy
  • Moderate multiprocessor sizes 2-8 then 16-64,
    but not 1K
  • Optimize for these workloads (e.g. cache-to-cache
    transfers)
  • Key Tool Multiprocessor Prediction Speculation
  • Make a guess... verify it later
  • Uniprocessor predecessors branch set
    predictors
  • Recent multiprocessor work Mukherjee/Hill
    ISCA98, Kaxiras/Goodman HPCA99 Lai/Falsafi
    ISCA99
  • Multicast Snooping
  • Timestamp Snooping

25
Comparison of Coherence Methods
Use prediction to improve on both?
26
Multicast Snooping
  • On cache miss
  • Predict "multicast mask" (e.g., bit vector of
    processors)
  • Issue transaction on multicast address network
  • Networks
  • Address network that totally-orders address
    multicasts
  • Separate point-to-point data network
  • Processors snoop all incoming transactions
  • If it's your own, it "occurs" now
  • If another's, then invalidate and/or respond
  • Simplified directory (at memory)
  • Purpose Allows masks to be wrong (explained
    later)

27
Predicting Masks
  • Performed at Requesting Processor
  • Include owner (GETS/GETX) all sharers (GETX
    only)
  • Exclude most other processors
  • Techniques
  • Many straightforward cases (e.g., stack,
    code,space-sharing)
  • Many options (network load, PC, software,
    local/global)

predicted mask
Mask Predictor
block address
feedback
28
Implementing an Ordered Multicast Network
  • Address Network
  • Must create the illusion of total order of
    multicasts
  • May deliver a multicast to destinations at
    different times
  • Wish List
  • High throughput for multicasts
  • No centralized bottlenecks
  • Low latency and cost ( pipelined broadcast tree)
  • ...
  • Sample Solutions
  • Isotach Networks Reynolds et al., IEEE TPDS
    4/97
  • Indirect Fat Tree ISCA 99
  • Direct Torus

29
Indirect Fat Tree ISCA 99
P D M
30
Indirect Fat Tree, cont.
  • Basic Idea
  • Processors send transactions up to roots
  • Roots send transactions down with logical
    timestamp
  • Switches stall transactions to keep in order
  • Null transaction sent to avoid deadlock
  • Assessment
  • Viable high cross-section bandwidth
  • Many "backplane" ASICs means higher cost
  • Often stalls transactions
  • Want
  • Lower cost of direct connections
  • Always delivery transactions as soon as possible
    (ASAP)
  • Sacrifice some cross-section bandwidth

31
Direct 2-D Torus (work in progress)
  • Features
  • Each processor is switch
  • Switches directly connected
  • E.g., network of Compaq 21364
  • Network order?
  • Broadcasts unordered
  • Snooping needs total order
  • Solution
  • Create order with logical timestampsinstead of
    network delivery order
  • Called Timestamp Snooping ASPLOS 00

0
1
15
14
32
Timestamp Snooping
  • Timestamp Snooping
  • Snooping with order determined by logical
    timestamps
  • Broadcast (not multicast) in ASPLOS 00
  • Basic Idea
  • Assign timestamp to coherence transactions at
    sender
  • Broadcast transactions over unordered network
    ASAP
  • Transaction carry timestamp (2 bits)
  • Processors process transactions in timestamp
    order

33
Timestamp Snooping Issues
  • More address bandwidth
  • For 16-processors, 4-ary butterfly, 64-byte
    blocks
  • Directory 38 372 more 240 more
  • Timestamp Snooping 218 372 384 (lt 60
    more)
  • Network must guarantee timestamps
  • Assert future transactions will have greater
    timestamps(so processor can processor older
    transactions)
  • Isotach Reynolds IEEE TPDS 4/97 more
    aggressively
  • Other
  • Priority queue at processor to order transactions
  • Flow control and buffering issues

34
Initial Multifacet Results
  • Multicast Snooping ISCA 99
  • Ordered multicast of coherence transactions
  • Find data directly from memory or caches
  • Reduce bandwidth to permit some scaling
  • 32-processor results show 2-6 destinations per
    multicast
  • Timestamp Snooping ASPLOS 00
  • Broadcast snooping with order determined by
    logical timestamps carried by coherence
    transactions
  • No bus Allows arbitrary memory interconnects
  • No directory or directory indirection
  • 16-processor results show 25 faster for 25 more
    traffic

35
Selected Issues
  • Multicast Snooping
  • What program property are mask predictors
    exploiting?
  • Why is there no good model of localityor the
    90-10 rule in general?
  • How does one build multicast networks?
  • What about fault tolerance?
  • Timestamp Snooping
  • What is an optimal network topology?
  • What about buffering, deadlock, etc.?
  • Implementing switches and priority queues?

36
Outline
  • Motivation
  • System Area Networks
  • Designing Multiprocessor Servers
  • Server Cluster Trends
  • Out-of-box and highly-available servers
  • High-performance communication for clusters

37
Multiprocessor Servers
  • High-Performance Communication within box
  • SMPs (e.g., Intel PentiumPro Quads)
  • Directory-based (SGI Origin2000)
  • Trend toward hierarchical out of box solutions
  • Build bigger servers from smaller ones
  • Intel Profusion, Sequent NUMA-Q, Sun WildFire
    (pictured)

38
Multiprocessor Servers, cont.
  • Traditionally had poor error isolation
  • Double-bit ECC error crashes everything
  • Kernel error crashes everything
  • Poor match for highly available Internet
    infrastructure
  • Improve error isolation
  • IBM 370 virtual machines
  • Stanford HIVE cells

39
Clusters
  • Traditionally
  • Good error isolation
  • Poor communication performance (especially
    latency)
  • LANs are not optimized for clusters
  • Enter Early SANs
  • Berkeley NOW w/ Myricom Myrinet
  • IBM SP w/ proprietary network
  • What now with InfiniBand SAN (or alternatives)?

40
A Prediction
  • Blurring of cluster server boundaries
  • Clusters
  • High communication performance
  • Servers
  • Better error isolation
  • Multi-box solutions
  • Use same hardware configure in the field
  • Issues
  • How do we model these hybrids?
  • Should PODC SPAA also converge?

41
Three Questions
  • What is a System Area Network (SAN)and how will
    it affect clusters?
  • E.g., InfiniBand
  • Make computation, storage, network orthogonal
  • How fat will multiprocessor servers beand how to
    we build larger ones?
  • Varying sizes for soft hard state
  • E.g., Multicast Snooping Timestamp Snooping
  • Future of multiprocessor servers clusters?
  • Servers will support higher availability
    extra-box solutions
  • Clusters will get server communication performance
Write a Comment
User Comments (0)
About PowerShow.com