Cache Coherence and Interconnection Network Design - PowerPoint PPT Presentation

About This Presentation
Title:

Cache Coherence and Interconnection Network Design

Description:

distributed linked list of copies, weaves through caches ... Directory is similar to tracker in Bit Torrent (BT) or root in P2P network ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 37
Provided by: david2174
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Cache Coherence and Interconnection Network Design


1
Cache Coherence and Interconnection Network Design
  • Adapted from UC, Berkeley Notes

2
Cache-based (Pointer-based) Schemes
  • How they work
  • home only holds pointer to rest of directory info
  • distributed linked list of copies, weaves through
    caches
  • cache tag has pointer, points to next cache with
    a copy
  • on read, add yourself to head of the list (comm.
    needed)
  • on write, propagate chain of invalidations down
    the list
  • What if a link fails? gt Scalable Coherent
    Interface (SCI) IEEE Standard
  • doubly linked list
  • What to do on replacement?

3
Scaling Properties (Cache-based)
  • Traffic on write proportional to number of
    sharers
  • Latency on write proportional to number of
    sharers!
  • dont know identity of next sharer until reach
    current one
  • also assist processing at each node along the way
  • (even reads involve more than one other assist
    home and first sharer on list)
  • Storage overhead quite good scaling along both
    axes
  • Only one head pointer per memory block
  • rest is all prop to cache size
  • Very complex!!!

4
Hierarchical Directories
  • Directory is a hierarchical data structure
  • leaves are processing nodes, internal nodes just
    directory
  • logical hierarchy, not necessarily physical
  • (can be embedded in general network)

5
Caching in the Internet (Server side caching,
Client side caching, Network caching)
6
Network Caching Proxy Caching Ref Jia Wang,
A Survey of Web Caching Schemes for the
Internet ACM Computing Surveys
7
Comparison with P2P
  • Directory is similar to tracker in Bit Torrent
    (BT) or root in P2P network Plaxtons root has
    one pointer and BT has all pointers
  • Real object may be somewhere else (call it as
    home node), as pointed by directory
  • The shared caches are extra locations of the
    object equivalent to peers
  • Duplicate directories possible see hierarchical
    directory design (MIND) later
  • Lot of possible designs for directory Can we
  • do similar design in P2P?

8
Plaxtons Scheme
  • Similar to Distributed Shared Memory (DSM) Cache
    Protocol
  • Root is the Home node in DSM, where directory is
    kept but not the real object. The root points to
    the nearest home peer, where a shared copy can be
    found.
  • Similar to Hierarchical Directory scheme, where
    intermediate nodes point to nearest object nodes
    (equivalent to shared copy). However, in
    hierarchical directory scheme Intermediate
    pointers point to all shared copies.

9
P2P Research
  • How many pointers do we need at the tracker? Cost
    model? Depends on popularity? How many peers to
    supply to client? Communication model? How to
    update directory?
  • Cache-based (Linked-list) approach like SCI?
  • Hierarchical directory Plaxtons paper
    Introduce such method for BT network?
  • Combine directorytree DHTs?
  • Is Dirty copy possible in P2P? Then develop
    protocol similar to cache
  • Sub-page coherence like DSM?

10
Interconnection Network Design
  • Adapted from UC, Berkeley Notes

11
Scalable, High Perf. Interconnection Network
  • At Core of Parallel Computer Arch.
  • Requirements and trade-offs at many levels
  • Elegant mathematical structure
  • Deep relationships to algorithm structure
  • Managing many traffic flows
  • Electrical / Optical link properties
  • Little consensus
  • interactions across levels
  • Performance metrics?
  • Cost metrics?
  • Workload?
  • gt need holistic understanding

12
Goals
  • Job of a multiprocessor network is to transfer
    information from source node to destination node
    in support of network transactions that realize
    the programming model
  • latency as small as possible
  • as many concurrent transfers as possible
  • operation bandwidth
  • data bandwidth
  • cost as low as possible

13
Formalism
  • network is a graph V switches and nodes
    connected by communication channels C Í V V
  • Channel has width w and signaling rate f 1/t
  • channel bandwidth b wf
  • phit (physical unit) data transferred per cycle
  • flit - basic unit of flow-control
  • Number of input (output) channels is switch
    degree
  • Sequence of switches and links followed by a
    message is a route
  • Think streets and intersections

14
What characterizes a network?
  • Topology (what)
  • physical interconnection structure of the network
    graph
  • direct node connected to every switch
  • indirect nodes connected to specific subset of
    switches
  • Routing Algorithm (which)
  • restricts the set of paths that msgs may follow
  • many algorithms with different properties
  • gridlock avoidance?
  • Switching Strategy (how)
  • how data in a msg traverses a route
  • circuit switching vs. packet switching
  • Flow Control Mechanism (when)
  • when a msg or portions of it traverse a route
  • what happens when traffic is encountered?

15
Topological Properties
  • Routing Distance - number of links on route
  • Diameter - maximum routing distance between any
    two nodes in the network
  • Average Distance Sum of distances between
    nodes/number of nodes
  • Degree of a Node Number of links connected to a
    node gt Cost high if degree is high
  • A network is partitioned by a set of links if
    their removal disconnects the graph
  • Fault-tolerance Number of alternate paths
    between two nodes in a network

16
Typical Packet Format
  • Two basic mechanisms for abstraction
  • encapsulation
  • fragmentation

17
Review Performance Metrics
Sender
(processor busy)
Transmission time (size bandwidth)
Time of Flight
Receiver Overhead
Receiver
(processor busy)
Transport Latency
Total Latency
Total Latency Sender Overhead Time of Flight
Message Size BW
Receiver Overhead
Includes header/trailer in BW calculation?
18
Switching Techniques
  • Circuit Switching A control message is sent from
    source to destination and a path is reserved.
    Communication starts. The path is released when
    communication is complete.
  • Store-and-forward policy (Packet Switching) each
    switch waits for the full packet to arrive in
    switch before sending to the next switch (good
    for WAN)
  • Cut-through routing or worm hole routing switch
    examines the header, decides where to send the
    message, and then starts forwarding it
    immediately
  • In worm hole routing, when head of message is
    blocked, message stays strung out over the
    network, potentially blocking other messages
    (needs only buffer the piece of the packet that
    is sent between switches). CM-5 uses it, with
    each switch buffer being 4 bits per port.
  • Cut through routing lets the tail continue when
    head is blocked, storing the whole message into
    an intermmediate switch. (Requires a buffer large
    enough to hold the largest packet).

19
(No Transcript)
20
Store and Forward vs. Cut-Through
  • Advantage
  • Latency reduces from function ofnumber of
    intermediate switches X by the size of the packet
    to time for 1st part of the packet to
    negotiate the switches the packet size
    interconnect BW

21
StoreForward vs Cut-Through Routing
  • h(n/b D) vs n/b h D
  • what if message is fragmented?
  • wormhole vs virtual cut-through

22
(No Transcript)
23
Example
  • Q. Compare the efficiency of store-and-forward
    (packet switching) vs. wormhole routing for
    transmission of a 20 bytes packet between a
    source and destination, which are d-nodes apart.
    Each node takes 0.25 microsecond and link
    transfer rate is 20 MB/sec.
  • Answer Time to transfer 20 bytes over a link
    20/20 MB/sec 1 microsecond.
  • Packet switching nodes x (node delay
    transfer time) d x (.25 1) 1.25 d
    microseconds
  • Wormhole ( nodes x node delay) transfer time
  • 0.25 d 1
  • Book For d7, packet switching takes 8.75
    microseconds vs. 2.75 microseconds for wormhole
    routing

24
Contention
  • Two packets trying to use the same link at same
    time
  • limited buffering
  • drop?
  • Most parallel mach. networks block in place
  • link-level flow control
  • tree saturation
  • Closed system - offered load depends on delivered

25
Delay with Queuing
  • Suppose there are L links per node. Each link
    sends a packet to another link at Lamda
    packets/sec. The service rate (linkswitch) is
    Mue packets per second. What is the delay over
    a distance D?
  • Ans There is a queue at each output link to hold
    extra packets. Model each output link as an M/M/1
    queue with LxLamda input rate and Mue service
    rate.
  • Delay through each link Queuing time Service
    time S
  • Delay over a distance D S x D

26
Congestion Control
  • Packet switched networks do not reserve
    bandwidth this leads to contention (connection
    based limits input)
  • Solution prevent packets from entering until
    contention is reduced (e.g., freeway on-ramp
    metering lights)
  • Options
  • Packet discarding If packet arrives at switch
    and no room in buffer, packet is discarded (e.g.,
    UDP)
  • Flow control between pairs of receivers and
    senders use feedback to tell sender when
    allowed to send next packet
  • Back-pressure separate wires to tell to stop
  • Window give original sender right to send N
    packets before getting permission to send more
    overlaps latency of interconnection with
    overhead to send receive packet (e.g., TCP),
    adjustable window
  • Choke packets aka rate-based Each packet
    received by busy switch in warning state sent
    back to the source via choke packet. Source
    reduces traffic to that destination by a fixed
    (e.g., ATM)

27
Routing
  • Recall routing algorithm determines
  • which of the possible paths are used as routes
  • how the route is determined
  • R N x N -gt C, which at each switch maps the
    destination node nd to the next channel on the
    route
  • Issues
  • Routing mechanism
  • arithmetic
  • source-based port select
  • table driven
  • general computation
  • Properties of the routes
  • Deadlock feee

28
Routing Mechanism
  • need to select output port for each input packet
  • in a few cycles
  • Simple arithmetic in regular topologies
  • ex Dx, Dy routing in a grid
  • west (-x) Dx lt 0
  • east (x) Dx gt 0
  • south (-y) Dx 0, Dy lt 0
  • north (y) Dx 0, Dy gt 0
  • processor Dx 0, Dy 0
  • Reduce relative address of each dimension in
    order
  • Dimension-order routing in k-ary d-cubes
  • e-cube routing in n-cube

29
Routing Mechanism (cont)
P0
P1
P2
P3
  • Source-based
  • message header carries series of port selects
  • used and stripped en route
  • CRC? Packet Format?
  • CS-2, Myrinet, MIT Artic
  • Table-driven
  • message header carried index for next port at
    next switch
  • o Ri
  • table also gives index for following hop
  • o, I Ri
  • ATM, HPPI

30
Properties of Routing Algorithms
  • Deterministic
  • route determined by (source, dest), not
    intermediate state (i.e. traffic)
  • Adaptive
  • route influenced by traffic along the way
  • Minimal
  • only selects shortest paths
  • Deadlock free
  • no traffic pattern can lead to a situation where
    no packets mover forward

31
Deadlock Freedom
  • How can it arise?
  • necessary conditions
  • shared resource
  • incrementally allocated
  • non-preemptible
  • think of a channel as a shared resource that
    is acquired incrementally
  • source buffer then dest. buffer
  • channels along a route
  • How do you avoid it?
  • constrain how channel resources are allocated
  • ex dimension order
  • How do you prove that a routing algorithm is
    deadlock free

32
Proof Technique
  • resources are logically associated with channels
  • messages introduce dependencies between resources
    as they move forward
  • need to articulate the possible dependences that
    can arise between channels
  • show that there are no cycles in Channel
    Dependence Graph
  • find a numbering of channel resources such that
    every legal route follows a monotonic sequence
  • gt no traffic pattern can lead to deadlock
  • network need not be acyclic, on channel
    dependence graph

33
Example k-ary 2D array
  • Theorem x,y routing is deadlock free
  • Numbering
  • x channel (i,y) -gt (i1,y) gets i
  • similarly for -x with 0 as most positive edge
  • y channel (x,j) -gt (x,j1) gets Nj
  • similarly for -y channels
  • any routing sequence x direction, turn, y
    direction is increasing

34
Deadlock free wormhole networks?
  • Basic dimension order routing techniques dont
    work for k-ary d-cubes
  • only for k-ary d-arrays (bi-directional)
  • Idea add channels!
  • provide multiple virtual channels to break the
    dependence cycle
  • good for BW too!
  • Do not need to add links, or xbar, only buffer
    resources
  • This adds nodes the the CDG, remove edges?

35
Breaking deadlock with virtual channels
36
Adaptive Routing
  • R C x N x S -gt C
  • Essential for fault tolerance
  • at least multipath
  • Can improve utilization of the network
  • Simple deterministic algorithms easily run into
    bad permutations
  • fully/partially adaptive, minimal/non-minimal
  • can introduce complexity or anomolies
  • little adaptation goes a long way!
Write a Comment
User Comments (0)
About PowerShow.com