Communication in Distributed Systems - PowerPoint PPT Presentation

Loading...

PPT – Communication in Distributed Systems PowerPoint presentation | free to view - id: 9297c-NzlhZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Communication in Distributed Systems

Description:

IP switching at Gb speeds emerging hot topic. IP directly on SONET ... New network technologies challenged system designers to reduce system latency ... – PowerPoint PPT presentation

Number of Views:884
Avg rating:3.0/5.0
Slides: 104
Provided by: douglas122
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Communication in Distributed Systems


1
Communication in Distributed Systems
  • EECS 750
  • Spring 1999
  • Course Notes Set 2
  • Chapter 2
  • Distributed Operating Systems
  • Andrew Tanenbaum

2
Overview
  • Obvious
  • It cant be a distributed computation if there is
    no communication
  • The communication media and protocols have a
    profound influence on performance
  • Less Obvious
  • Communication media and protocols can take many
    forms
  • Multiprocessor (SMP) and Bus contention
  • Hyper-cube and link contention
  • Ethernet, CDMA, and binary back-off

3
Overview
  • Protocols have two major design influences
  • Communication medium properties
  • Application domain properties
  • Implication
  • Diversity of application domain properties has a
    dumbing down effect
  • Protocols are selected for a compromise between
    competing properties
  • Breadth of application support - the lowest
    common denominator
  • Efficiency of application support

4
Overview
  • Layered Protocols
  • Hierarchy preserves the sanity of humans
  • Standardized for interoperability makes
    distribution more feasible
  • Common denominator phenomena
  • TCP/IP created before international standards
  • ISO-OSI reference Model
  • Standard 7 layers few if any use it literally
  • Fig 2-1, page 37 Tanenbaum
  • Design principle used in most distributed systems

5
Overview
  • Connection Oriented vs. Connectionless
  • Good for different application domains and use
    patterns
  • Reliable vs. Unreliable message delivery
  • Layering of protocols generally means layering of
    the packaging around the messages
  • New wrappers on messages as they descend
  • Stripping layers as they ascend
  • Fig 2-2 page 38
  • Principles also used for communication at the
    middleware and application layers

6
Protocol Layers
  • Protocol at each layer specialized to the duties
    of the layer
  • Client of the services provided by the layer
    below
  • Provider of the services used by the layer above
  • Continuum of protocols and relationships from the
    lowest physical layer to the highest layer in a
    distributed system
  • Major conceptual boundary between protocols
    provided by parts of the system implemented by
    different groups
  • OS vs. Middleware Middleware vs. Application

7
Protocol Layers Major Divisions
  • Physical and Data link
  • Physical devices and device drivers
  • Network
  • Component connectivity (IP)
  • Transport
  • Message transfer from end to end (TCP)
  • Session
  • Application relevance and semantics
  • Middleware squeezes in below here but above
    transport

8
Protocol Layers Physical and Data link
  • Manages bit transfer across a physical link
  • Serial
  • Ethernet
  • ATM
  • PCI or VME bus
  • Physical and electronic magic
  • From our (my) point of view
  • Magic bit pipe
  • Services of physical devices for which we may
    write drivers but otherwise electronic wizards
    and catalogs provide these services

9
Protocol Layers Network
  • View of the set of links among several
    communicating hosts (IP)
  • Each host certainly knows with what other hosts
    it can communicate
  • It may or may not know about the connectivity of
    other hosts
  • This provides a view of the network
  • Connection or connectionless becomes meaningful
    only here
  • Connection is a commitment and/or concept which
    spans 2 or more host-host links

10
Protocol Layers Network
  • X.25 is a familiar connection oriented protocol
  • IP is a familiar connectionless protocol
  • Routing refers to the problem of deciding how a
    message is taken from its source to its
    destination
  • Link by Link (Hop by hop)
  • End to End
  • Static or Dynamic routing
  • Entire seminar on IP routing strategies and
    implications (Evans and Whiting)
  • Reliable or Unreliable connections

11
Protocol Layers Transport
  • Provides more complex (and often specialized)
    semantics by using services of the network layer
  • Reliable connections on top of an unreliable
    underlying network
  • TCP on top of IP does this
  • TCP does other things as well
  • Provides the notion of a connection from source
    to destination socket
  • Exercises flow control on the traffic
  • Based on assumptions that are not always true
  • UDP is a connectionless option at this layer

12
Protocol Layers Session
  • This layer and above is usually provided by the
    application code in most systems
  • Middleware (CORBA/DCOM) assuming this role in
    many cases over the last few years
  • Inserts a layer between transport and session
  • OR - creates a lowest sub-layer
  • This layer often provides memory and context
  • Basis of connection restoration for fault
    tolerance or mobility

13
Protocol Layers Session
  • Parallel Virtual Machine (PVM) is plausibly
    middleware providing a session to its users
  • CORBA name service and brokers provide name
    spaces and abstractions above simple connections
  • Host/network speed ratio was large for decades
  • OC-3 is roughly the bandwidth of basic SCSI
  • Network disks suddenly seemed sensible
  • Affects the feasibility of distributed systems
  • Increases application inter-operability
  • Decreases implementation effort

14
High Performance Networks
  • Recent (last 5 years or so) introduction of high
    performance networks has significantly affected
    distributed system design and popularity
  • ATM (155 Mb (OC-3) and 622 Mb (OC-12))
  • 100 Mb and 1Gb Ethernet
  • Orders of magnitude faster networks affect host
    software and hardware design

15
High Performance Networks Asynchronous Transfer
Mode
  • Introduced a significantly faster medium
  • Shifted ratio of host to network
  • Changes economics of distributed systems by
    shifting cost/performance ratio
  • Connection oriented
  • Switched Virtual Circuits (SVCs)
  • Single Network for multiple traffic types
  • Voice, data, video, sound, games
  • Economy of scale and common denominator makes it
    attractive to many without being best for any

16
High Performance Networks Asynchronous Transfer
Mode
  • Simplicity of cells and their processing made HW
    based cell switching possible
  • IP switching at Gb speeds emerging hot topic
  • IP directly on SONET
  • Following and responding to ATM challenge
  • Switching fabrics have multi-Gb bandwidth
  • Sub-microsecond cell switching time
  • (Port, VC) to (Port,VC) mapping in switches
  • VC multiplexing on physical links
  • Details in other classes - not usually relevant
    here

17
High Performance Networks Fast Ethernet
  • 100 Mb switched Ethernet has emerged as a
    significant competitor to ATM
  • Comparable speeds
  • Generally less expensive
  • Different name space for data transfer
  • Host to Host not Virtual Circuits
  • Medium often still shared
  • Among a small number of hosts
  • Compelling price/performance
  • Particularly at first network hierarchy level
  • Gb Ethernet emerging

18
High Performance Networks Implications for DS
  • Implications of ATM and Fast Ethernet for
    Distributed Systems are similar but not identical
  • Vast increase in bandwidth is major effect
  • 10 Mb shared to 155 Mb (622) switched
  • Consider the rough equivalence of OC-3 and
    SCSI-II
  • All disks on the network are created equal
  • With respect to bandwidth not latency
  • Faster disk and faster/wider SCSI restored a two
    tiered hierarchy

19
High Performance Networks Implications for DS
  • Speed vs. Latency
  • Vitally important when considering DS design,
    implementation, and evaluation
  • Different and both important
  • In many DS, message transmission time approaches
    medium latency (L)
  • Messages are small
  • Total Time L(sender) transit time
    L(receiver)
  • Latency is asymptotic to light speed
  • Many people could have avoided a lot of effort if
    they assumed light speed

20
High Performance Networks Implications for DS
  • If it still wont work when you assume light
    speed
  • Try another approach
  • Difference between real and light speed latency
    is determined by the medium and the system
  • System latency often predominates
  • Not always
  • It is the part over which we have the most
    influence
  • New network technologies challenged system
    designers to reduce system latency
  • Increased network speed left OS bottleneck

21
High Performance Networks Implications for DS
  • Design Principle
  • Optimize the most significant component
  • Success shifts the focus and relevant design
    issues
  • Flow Control
  • Also significantly affected by new technology
  • Bandwidth-Delay product gives NW buffering
  • Also reveals potential waste in interactive apps
  • Significant increase in buffering a good example
    of the somewhat unexpected stresses on hosts

22
High Performance Networks Implications for DS
  • Protocol Adjustment
  • Motivated by new network technologies and new
    classes of applications
  • Active Networks - dynamic customization
  • Cell level pacing
  • Earliest packet discard
  • Differential Service in IP - Diffserv
  • Some believe IP already perfect (fewer and fewer)
  • Adaptations of IP a current area of frenzied
    research

23
Client/Server
  • Currently the most popular and ubiquitous method
    of implementing distributed applications
  • Critics and alternatives emerging
  • Not transparent to location and identity
  • But still useful
  • Fig 2-7 page 51 Tanenbaum
  • Request-Response style application level
    protocols
  • Note that the client must know the server
  • (Host-IP, Port) is a unique ID in name space
  • INETD generalizes this wrt port number

24
Client/Server Name Servers
  • Name servers generalize the host requirement
  • CORBA name service a good example
  • Creates a new name space
  • Name servers map between name spaces
  • Client must know the name server ID in the basic
    name space, but no other
  • Fig 2-10 page 57 Tanenbaum
  • Different physical machines can play a given role
    at different times
  • Service oriented names (www.altavista.com) can be
    mapped to different machines as required

25
Client/Server Canonical Problems
  • Name allocation and advertising
  • How does a client know the name of the name
    server and how to locate it
  • Servers can broadcast
  • Scalability of a given name space and server
    approach
  • LAN to WAN and World-WAN
  • Classic problems for distributed systems
  • Many management functions at the OS ad
    middleware levels have a strong client-server
    flavor

26
Client/Server Blocking vs. Non-Blocking
  • Client-Server interaction semantics
  • Blocking is synchronous
  • Non-Blocking is asynchronous
  • Fig 2-11 page 59 Tanenbaum
  • Semantic constraint
  • Sender cannot change a message buffer until it is
    safe
  • Easy to determine for synchronous operations
  • Harder for asynchronous operations
  • Signal for buffer release is a form of
    synchronization

27
Client/Server Blocking vs. Non-Blocking
  • Design Principle
  • Weakening the semantics of an operation generally
    increases its performance by decreasing the
    overhead
  • Separate issues of buffer safety and message
    transmission
  • Transmission vs. Delivery distinction
  • Weakening synchronization to buffer safety
  • Addresses the real constraint
  • Increases available concurrency
  • Potentially increases system performance

28
Client/Server Blocking vs. Non-Blocking
  • Semantic choices interact
  • System call blocking semantic choices
    (strong to weak)
  • Message delivery
  • Message transmitted
  • Message request noted by OS
  • Message buffer safety interacts with return
    semantics
  • Message copied into OS address space makes user
    buffer safe earlier than not
  • Overhead cost

29
Client/Server Buffered vs. Unbuffered
  • Where is the information buffered
  • Affect on communication semantics
  • Affect on system overhead
  • Interactions with other aspects of user
    computation semantics
  • Signals used to announce message buffer safety or
    message delivery may create critical sections
  • Basic choices
  • Block until delivery or sent
  • Block and copy to OS
  • Non-blocking and use user buffer

30
Client/Server Buffered vs. Unbuffered
  • This generally considers send synchronization
  • Receive synchronization can also matter
  • Block until message arrives (common default)
  • Return if no message available
  • Design tradeoffs
  • Cost vs. Ability
  • Simplicity vs. Ability
  • OS provides receive buffers to avoid interrupting
    the process when a message arrives
  • Pays the price in overhead of space and copy

31
Client/Server Buffered vs. Unbuffered
  • What choice is best depends on required and
    desirable semantics vs. cost
  • Distributed systems are often willing to cope
    with less forgiving and restricted semantics to
    gain performance
  • Both approaches are best in some cases
  • Most systems provide several sets of
    communication primitives wit varying semantics
  • Most systems default to buffered approach
  • Important to know what choices are being made for
    you

32
Client/Server Reliable vs. Unreliable
  • Underlying physical networks are all subject to
    different rates of message loss
  • Common units used bit error rates
  • A bit error usually means a lost packet at the
    transport (TCP) level
  • High rate of uncorrelated errors would make
    simple error correction encoding attractive
  • Digital radios are an example
  • Note the interaction among abstractly separate
    layers
  • Error rate at low level may affect higher layer

33
Client/Server Reliable vs. Unreliable
  • Design Issue
  • Should the programming interface provide reliable
    or unreliable communication primitives

34
Client/Server Reliable vs. Unreliable
  • Design Issue
  • Should the programming interface provide reliable
    or unreliable communication primitives
  • It Depends
  • Semantics of applications differ
  • Importance of performance vs. loss vs. latency
  • UDP for high performance, low latency, loss
    tolerant
  • TCP for reliable in-order delivery
  • Multi-media stresses low latency over low loss

35
Client/Server Design Choices Summary
  • Competitive relations among
  • Ease of use
  • Bandwidth
  • Latency
  • Efficiency
  • Generality
  • Figure 2-14 page 65 Tanenbaum
  • System Programmers Full Employment Policy
  • No final best solution
  • Tradeoffs always shift with 10s of factors

36
Remote Procedure Call
  • Extends Client-Server by concealing distribution
    and the associated communication at the
    programming interface
  • Everything looks like a procedure call
  • Unifies centralized and distributed computing at
    the programming interface level
  • Approaches the goal of a Uniform and Universal
    Virtual Machine
  • Everyone (almost) shares this goal in general
  • Not everyone shares this goal in detail
  • Cost makes the difference

37
Remote Procedure Call
  • RPC layer shoves the distribution down a layer in
    the programming model
  • Generalizes the procedure call
  • Early effort to support distribution
  • Great but many subtle implications that affect
    cost, utility, applicability
  • Excellent example of how distribution is almost
    always harder than it first appears

38
Remote Procedure Call Advantages
  • Write a program in a normal and conventional way
  • All procedure calls are written the same in the
    bulk of the code
  • Declarations of procedures may or may not
    identify them as remote
  • Promotes portability in the sense that the same
    code can be mapped onto a range of distribution
    scenarios
  • Several choices about how to describe
    distribution and the assignment of servers for
    specific procedures - both compile and run time
    possible

39
Remote Procedure Call Disadvantages
  • Operations can be done in arbitrary places
  • Advantage and Disadvantage
  • Different hardware
  • Different instructions mean complex compilation
    control
  • Different data formats mean format tracking and
    conversion overhead
  • Different Address Spaces
  • How much of the address space do you need

40
Remote Procedure Call Disadvantages
  • Parameters and additional context must be
    transferred from client to server
  • Determining the required context can be difficult
  • Too little incorrect computation semantics
  • To much unnecessary overhead
  • Implications of the programming model
  • Pure Functional Programming which uses only
    passed parameters with no side effects
  • Other paradigms may effectively require the whole
    address space

41
Remote Procedure Call Basic RPC
  • Goal is for RPC to be transparent
  • Everything is a procedure call - local or remote
  • Issues arise from unifying interface model
  • Procedure calls
  • Call stack
  • Regular procedure calls transfer control within a
    single address space
  • System calls use the same model but transfer
    across user-system address space boundary
  • Requires special code in the system call to
    transfer parameters and give OS access to all
    necessary context

42
Remote Procedure Call Basic RPC
  • RPC needs to preserve the procedure call
    semantics
  • On both the client and the server
  • Send the procedure context to the server
  • Return the results back to the client
  • Language semantics have a huge affect on context
    and how hard it is to figure out
  • Call by value vs. Call by reference
  • Multiple copies and resulting coherence issues
  • Implications have implications in a ripple effect

43
Remote Procedure Call Basic RPC
  • Implementation through stubs
  • Figure 2-18 page 71 Tanenbaum
  • Client stub
  • Represents client side duties of the RPC
  • Assembles RPC context (marshalling)
  • Receives results (demarshalling)
  • Server stub
  • Unpacks context
  • Calculates
  • Packs results

44
Remote Procedure Call Basic RPC
  • Consider Client/Server implementations of system
    services using RPC interfaces
  • Generality and portability in support of a wide
    range of distribution scenarios
  • Application and/or analogy to micro-kernel
  • Multiple designs/models/presentations of a given
    service provided by the procedure
  • Analogy to overloading in object-oriented systems
  • Compile time or run time selection of service
    options

45
Remote Procedure Call Parameter Passing
  • Client stub receives parameters in the normal way
  • It is after al a procedure
  • It collects context and packs it into a message
  • Sends message
  • Data format conversion may take place on either
    side
  • Data representation standards - canonical format
  • CORBA data interchange format
  • RPC components written by different people
  • Stub generation by compiler
  • Commonly automates marshalling code

46
Remote Procedure Call Parameter Passing
  • Automation and run-time decisions add to
    distribution transparency
  • Pointers!!!!
  • The entire address space is the context!
  • Dynamic binding
  • Clients find servers at run time
  • Clients determine context at run time
  • CORBA addresses a lot of these issues
  • Subsumes and generalizes RPC in the object
    oriented framework

47
Failure Handling
  • Making the distribution of the system or
    application transparent is hard
  • Handling some failures and masking the occurrence
    of others is one reason
  • Failures particular to distributed (RPC)
    situation
  • Cannot find server
  • Lost request client to server link
  • Lost reply server to client link
  • Client crashes after sending request
  • Server crashes after receiving request
  • Good canonical set of failures covering most types

48
Failure Handling Cannot Find Server
  • Maybe the server is down
  • Dynamic server selection attractive
  • Distribution introduces new types of errors
  • Wider range of operations is done remotely
  • Return value of -1 and setting errno not really
    expressive enough any more
  • Signals and exceptions are candidates for
    expanding error handling semantics
  • Motivation for language extension
  • transparency challenge

49
Failure Handling Lost Request
  • Easiest error to understand and handle
  • Client timeout is common obvious remedy
  • Confusion of message latency and loss
  • Network or slow server latency interpreted as
    loss or server unavailable
  • ITTC AMD times out before ATM support for NFS
    works so AMD decides ATM broken
  • Timeouts are a trap wrt graceful degradation
  • A system slows as it handles more
  • Suddenly timeouts expire and everything is
    declared broken

50
Failure Handling Lost Reply
  • Much harder
  • Obvious solution is another timer
  • Could just be a slow server
  • Server must be written to handle duplicate
    requests unless the supporting network gives
    explicit loss indication
  • Client must be written to handle duplicate
    replies
  • Client-Server protocol should distinguish lost
    replies and a slow server
  • Semantics of the operation are involved
  • Come can be repeated, some cannot, as some have
    side effects

51
Failure Handling Lost Reply
  • Idempotent
  • Denotes an operation that can be repeated without
    constraint
  • Distributed system designers thus wish to
    maximize the number of idempotent operations
  • Cannot be universal bank transfers
  • Protocol encapsulating client-server interaction
    handling repeats and race conditions
  • Sequence numbers on messages help
  • Flag denoting a repeated request

52
Failure Handling Server Crash
  • Difficulty here is related to idempotency
  • Sequence numbers are not enough to fix it
  • Consider Fig 2-24 page 82 Tanenbaum
  • Server crashes before fully serving request (b)
  • Crashes after serving but before replying (c)
  • Handled very differently
  • B requires the server to redo a partially
    completed (possibly ATOMIC!) transaction
  • C merely requires repeating the reply
  • BUT the client cannot tell the difference

53
Failure Handling Server Crash
  • Different Semantic Categories
  • At Least Once
  • Guarantees RPC completes at least once
  • Maybe more than once
  • At Most Once
  • Guarantees RPC done at most once
  • Maybe zero
  • No Guarantees
  • Server provides no help and no promises
  • Luck of the draw mostly still works OK

54
Failure Handling Server Crash
  • Exactly Once
  • Most desirable but generally not possible
  • What can a crashed server know about what it did
    and did not do?
  • Data Base perspective helps
  • Atomic transactions
  • Commit and rollback concepts
  • Raises the cost of RPC
  • Increases the granularity of distribution
  • Pretty emphatically violates transparency of
    distribution

55
Failure Handling Client Crash
  • Client crashes after making a request
  • Creates an orphaned computation on server
  • Orphans waste server time and decrease
    concurrency
  • Further descendants are possible
  • Grand orphans and so on
  • Nested RPC
  • Four solutions proposed by Nelson (1981) are
    instructive wrt complication of designing
    distributed algorithms

56
Failure Handling Client Crash
  • Solution 1 Transaction Perspective
  • Log requests on a medium that survives crashes
  • Synchronous disk writes
  • Delete orphans on recovery
  • Called extermination
  • Huge overhead of logging every RPC
  • Hard to ensure killing grand orphans
  • Finding them
  • Ability to tell their server to kill them
  • Big time delay during reboot

57
Failure Handling Client Crash
  • Solution 2 Epochs
  • Known as reincarnation
  • Rebooting clients increment epoch number thus
    dividing the time line into periods between
    system boots by epoch number
  • Broadcast to servers as they establish relations
  • Servers delete requests associated with previous
    epochs
  • Replies are tagged with epoch number making
    obsolete ones easy to eliminate

58
Failure Handling Client Crash
  • Solution 3 Gentle Reincarnation
  • Epoch number from booting host broadcast
  • Each machine, client or server, sees if it has
    any remote computations
  • Only if the owner of the remote computation
    cannot be found is the computation killed
  • Recycle and reconnect remote computation
    components

59
Failure Handling Client Crash
  • Solution 4 Expiration
  • Each RPC is given a standard amount of time
  • Over-run requires explicit request of another
    quantum T
  • If a server crashes it waits T before rebooting
  • All orphans are thus assured of having timed out
  • Choosing T is a compromise between throughput and
    latency as well as other forms of overhead

60
Failure Handling Summary
  • In practice none of these is all that attractive
  • They illustrate the complications of doing
    distributed systems
  • Correctly
  • Efficiently
  • Transparently
  • Killing orphans may cause problems
  • Did they hold locks?
  • Orphan side effects - non-idempotent

61
Implementation Performance
  • Details determine the fate of a distributed
    system
  • Implementation cost in time and resources
  • Performance of the system produced relative to
    optimal performance for the applications
  • Note that optimal performance does not
    necessarily mean maximum performance
  • Semantics problems increase overhead
  • Custom vs. Existing Protocols
  • Ease of using existing protocols
  • Lower efficiency due to lack of specialization
  • Specialization is brittle wrt design assumptions

62
Implementation Performance Protocol Latency
  • Tradeoff between positive and negative cases
  • Successful communication delayed by overhead of
    protocol features supporting error recovery
  • Error recovery delayed by lack of support
  • Goal Minimize both overhead on successful cases
    and delay in recovering from errors
  • Inventing perpetual motion would be good too
  • Most of the time these goals compete
  • Expected Delay is analogous to expected value
  • Important statistical measure
  • So are minimum and maximum

63
Implementation Performance Critical Path
Analysis
  • General design and analysis principles
  • Optimize the subsystem(s) executed most
  • Figure 2-26 page 89 Tanenbaum
  • Illustrates RPC critical path - client to server
  • Optimize the biggest Component
  • Usually data copying is the biggest component
  • It Depends on the load assumed
  • Copying increases with data volume (obvious)

64
Implementation Performance Critical Path
Analysis
  • Scatter Gather
  • Network protocols and/or interface hardware can
    deal with layered wrappers of data packets added
    by protocol layers being in disjoint memory
    locations
  • Major (original) motivation for inventing Mbuf
    data structure
  • Separates management information from data buffer
    - separate pools
  • Offset Counter
  • Double pointer set next and continue

65
Implementation Performance Critical Path
Analysis
  • Role of Virtual memory
  • Twiddling memory maps and pointers is often much
    faster than copying the data
  • Common way to avoid copying across system-user
    address space boundary
  • Motivation for separating buffer from header
  • Always (almost) a break-even packet size
  • Should be determined by measurement
  • Bottom line Does copying or administrative magic
    cost more under various operating conditions
  • Faster networks push current system limits

66
Implementation Performance Timers
  • Differences in time scales among critical path
    components can introduce HUGE latencies
  • Delays of lower component leaves unused gaps in
    utilization of faster component
  • Timeouts
  • Generally very long by the time scale of the
    fastest associated component
  • TCP timeout commonly 500 ms
  • Assumption most timeouts do not occur so setting
    and unsetting them is overhead
  • TCP over ATM even one timeout can lower
    effective throughput

67
Implementation Performance Timers
  • Increase in network speeds is changing the time
    scale that matters for system software design and
    performance
  • Common granularity of 10 ms is increasingly
    coarse compared to scale of system components
  • DEC UNIX lowered granularity to 1 ms
  • Heart Beat timing method
  • Periodic timer interrupt (10 ms or 1ms)
  • Each tick executes timer interrupt service
    routine which increments software clock
  • Also checks for events which become active

68
Implementation Performance Timers
  • Heartbeat method has several implications
  • Scheduled event times are quantized to heartbeat
    scale (twice tick size)
  • At least next tick after desired time
  • Measured times are either limited to tick scale
    or must use another method
  • Time stamp counter in Pentium and Alpha offers a
    CPU clock tick counter
  • Some systems use both
  • Provide finer granularity for time of day and
    time stamps
  • Not interrupt driven time scales

69
Implementation Performance Timers
  • Finer time scale for scheduled events also
    desired
  • Each interrupt adds overhead
  • Decreasing tick size from 10ms to 1 microsecond
    increases overhead by 104
  • Faster networks may need finer grain time scale
  • 100 Mb/s network - 12 MB/s - 120 KB/10 ms
  • 6 MB in 500 ms TCP timeout
  • Tough problem
  • Motivation for finer granularity also urges low
    overhead for setting and unsetting timer events

70
Implementation Performance Managing Scheduled
Events
  • Managing a set of scheduled events is a basic
    system service
  • Two common methods timeout list and process
    table sweep
  • Timeout list
  • List of events sorted by ticks until event
  • Insertion and deletion requires O(N) on linked
    list implementation
  • Latest Linux is a Heap
  • Tick ISR decrements front element
  • Event occurs when it reaches zero

71
Implementation Performance Managing Scheduled
Events
  • Sweep method
  • Events are associated with PCB of process
  • Sweep the PCB table every quantum
  • Assume coarse granularity, single timer per
    process
  • Principle optimize the most common case
  • Comparison
  • Timeout minimizes tick ISR and permitting finer
    granularity with more insertion overhead
  • Sweep minimizes event creation but is more limited

72
Implementation Performance Design Principles
  • Always measure
  • Even the smartest are regularly surprised
  • Always easy to understand after you know it is
    happening
  • All solved problems are trivial - J. Hilliard
  • Optimize the system components in reverse order
    of size in the parameter being optimized
  • Expected Value Technique
  • Use probability of an execution path to guide
    optimization
  • E.G. Optimistic Concurrency Control

73
Implementation Performance Other Issues
  • RPC implemented on client/server sockets is a
    common approach
  • Design Challenges
  • Competition between transparency and acceptable
    overhead
  • Global Variables are still common making
    determination and handling of sending RPC context
    to the server really hard
  • Errno is part of the POSIX standard
  • Full transparency would require errno to always
    be set correctly and in the same ways as on a
    uni-processor system

74
Implementation Performance Other Issues
  • Design Challenge
  • Extending uni-processor behavior to
    multi-processor and distributed multi-computer
  • Hard enough to correctly make the published
    standards transparent
  • True transparency requires reproducing the
    undocumented behavior
  • This is the reason programs can work correctly
    for the wrong reasons
  • Good SWE and portability means you should always
    keep it simple and standard

75
Implementation Performance Compiler and
Language Support
  • Compilers and operating systems implement
    different aspects of the programming model
  • Very Important Issue and Conference
  • Architectural Support for Programming Languages
    and Operating Systems (ASPLOS)
  • Strong typing provides rich information on data
    structures and operations under RPC
  • More information promotes better efficiency and
    transparency when marshalling
  • Language design strongly influences what is
    required for the context of an RPC

76
Group Communication
  • Question
  • How many fundamentally different kinds of
    distributed computing scenarios exist?
  • No is sure - everyone is pretty sure nobody knows
    the answer
  • Question is also too broad considering several
    categories of things together that can be
    contemplated separately
  • Better Question
  • How many different types of communication would
    the full range of distributed computing scenarios
    require as support

77
Group Communication Types of Communication
  • RPC
  • Illustrates many important principles
  • Limited to two parties with a fairly strict
    protocol limiting exchanges between them
  • Distributed computing includes scenarios with
    many cooperating components
  • Example a set of cooperating files servers
    providing transparent fault tolerant service
    based on redundancy and active backups
  • File server communication patterns will not map
    onto pair-wise client server or RPC very well

78
Group Communication Types of Communication
  • RPC and client/server are point-to-point
    communication modes
  • One-to-one communication is important
  • One-to-many communication is also important in
    many distributed computing situations
  • Especially the more complicated ones
  • Dynamic communication group membership
  • Implementation of group communication and its
    efficiency depends on properties of the
    underlying network and hardware
  • Hardware properties and support vary a lot

79
Group Communication One to Many Communication
  • Several forms of group communication differing in
    implementation and semantics issues
  • Group membership and naming
  • Closed or Open Group
  • Peer or Hierarchy
  • Group Management
  • Addressing and IPC primitives
  • Atomicity
  • Message ordering
  • Scalability

80
One to Many Communication Multicast
  • Broadcast is multiple copies of a message to all
    machines
  • Multicast is multiple copies of a message to a
    specific set of machines
  • Obvious advantage in network overhead
  • Obvious overhead in managing the group
  • Network layer support is the best way although
    user level libraries exist
  • Hardware support better
  • Network properties have a significant influence
  • Consider Ethernet vs. ATM

81
One to Many Communication Closed vs. Open Group
  • Closed
  • Communication only among group members
  • Open
  • Outside processes may send messages to the group
    as well
  • Choice is controlled by application level issues
    motivating the group communication
  • Set of players in an interactive virtual
    environment is plausibly closed
  • Set of servers providing fault tolerant services
    would not be closed

82
One to Many Communication Peer vs. Hierarchic
Organization
  • Group authority and responsibility relationships
    affect message
  • Destination
  • Processing
  • Overhead
  • Master/Slave structure might have all messages
    flow through the master
  • Single point of control
  • Single point of failure
  • Fault tolerant servers might use peer
  • Control more complex After you. No, after you

83
One to Many Communication Group Management
  • Group server is required
  • Track group membership
  • Manage group communication at some level
  • Single server is a single point of failure
  • Multiple servers must exchange information and
    handle distributed data with multiple copies
  • Enter and Leave Semantics
  • Handle crashes and thus incarnations of group
    members
  • Resolve distributed server conflicts

84
One to Many Communication Group Management
  • Synchronization of messages and process
    incarnations
  • Must a processes receive messages sent before
    they left
  • A Process certainly must not receive messages
    sent before they joined or after they left
  • VERY IMPORTANT QUESTION (Chapter 3)
  • What clock decides when a member leaves and thus
    what before and after mean?
  • Group recovery with multiple crashes
  • No general solution it depends on application
    and situation semantics

85
One to Many Communication Group Addressing
  • Group name space
  • Name space of components which form groups
  • Name space of groups to which messages may be
    sent
  • Not all implementations are created equal
  • True Multicast one send, specific receives
  • Broadcast one send, all receive but OS discards
  • Unicast multiple sends
  • Figure 2-33 page 104 Tanenbaum
  • What kind and amount of overhead and who pays

86
One to Many Communication Group Addressing
  • Predicate Addressing
  • Message contains a predicate evaluated by the
    destination which evaluates to keep or discard
  • Automatically enables dynamic group addressing
  • Increases the importance of the name space
    describing the components
  • Capability and difficulty to forma predicate
    specifying any set of components
  • Interesting power and flexibility for
    sophisticated distributed computing scenarios
  • Restricted form of active networks

87
One to Many Communication IPC Primitives
  • RPC is too restrictive
  • Various forms of send and receive are common
  • Modifications as required to change or constrain
    the communication semantics
  • Design Decisions
  • What part implements the new semantics
  • What interface changes are implied
  • Another motivation for active networks
  • Transparency
  • Same primitives using sockets
  • Socket attributes choose Point-to-Point or
    Multicast

88
One to Many Communication Atomicity
  • All-or-Nothing property of group communication
  • Either every destination receives the message or
    none do
  • Makes implementing many distributed algorithms
    much easier
  • Otherwise the application layer must deal with
    more common and complex inconsistencies
  • Harder than it seems
  • As usual

89
One to Many Communication Atomicity
  • ACK overhead to ensure reliable delivery
  • Failure tolerance
  • Crashed Senders and Receivers
  • Receive and Forward Strategy
  • Every component receiving a message for the first
    time forwards it to all other group members
  • All surviving members guaranteed to get message
  • Bad News Overhead
  • messages for N group members
  • Worse news
  • This is actually Good, or at least not bad

90
One to Many Communication Message Ordering
  • Two properties make group communication much
    easier for implementing distributed computations
  • Atomic Broadcast - all or none reception
  • Ordered message delivery
  • Problem
  • Messages sent across networks incur delay
  • Messages from various sources to various
    destinations
  • A given set of messages may arrive at a given
    destination in a different order
  • Figure 2-34 page 108 Tanenbaum

91
One to Many Communication Message Ordering
  • Consider a distributed transaction (A Bank)
  • Message (deposit, withdrawal) order is important
  • Best Solution
  • Instantaneous delivery to all destinations
  • Right after perpetual motion
  • Global Time Ordering
  • Messages are delivered in the order they were
    sent according to a global clock
  • Einstein demonstrated that no such absolute
    global time exists

92
One to Many Communication Message Ordering
  • A DS could implement an Absolute time ordering by
    imposing synchronization and a global heartbeat
  • Constrains concurrency
  • Slows things down - in general
  • Consistent Time Ordering
  • Relaxes global time constraint (unneeded)
  • All messages arrive in the same order at all
    group members
  • Arrival order may not match global time
  • Still hard and weaker semantics are still useful

93
One to Many Communication Scalability
  • Many approaches to communication work well for
    small groups but not for large groups
  • They are still good for distributed computations
    with a modest number of components
  • Consider the implications of a cast of thousands
  • Obvious increase in the number of messages
  • Subtle change in network properties and
    assumption validity moving from LAN to WAN
  • Multicast aware gateways and routers required
  • Computational complexity and centralization often
    constrain scalability

94
Group IPC in ISIS
  • Project explored question of what basic
    communication primitives are required to support
    distributed computing
  • ISIS is a toolkit for building distributed
    applications
  • Horus was its successor and both commercialized
  • Ensemble is the successor to Horus and free
  • ISIS is a set of programs, utilities, and library
    code that run on top of common OS platforms
  • Implements several types of synchronization
  • Illustrates tradeoff of semantic strength against
    implementation and execution cost

95
Group IPC in ISIS Synchronization Variations
  • Synchronization variants differ in semantic
    strength and implementation cost
  • Synchronous
  • Every event happens strictly separately,
    sequentially, and instantaneously
  • Impossible to build - weaker semantics required
  • Loosely Synchronous
  • Events take finite time and appear in the same
    order to all group members
  • Possible but expensive to build
  • Weaker semantics are cheaper and often sufficient

96
Group IPC in ISIS Synchronization Variations
  • Virtually Synchronous
  • Relaxes the in order delivery constraint
  • Carefully so that the variation in message
    delivery order at different receivers will not
    matter under specific criteria
  • Causally Related Events
  • Two events are causally related if the nature or
    behavior of the second may have been influenced
    by the first
  • Delivery order of causally related events
    matters, that of unrelated events does not

97
Group IPC in ISIS Virtually Synchronous
  • Unrelated events are concurrent
  • Causally related events are delivered in order
  • Others may or may not be
  • Weakening semantics
  • Decreases overhead
  • Increases concurrency
  • Opens the opportunity, at least, for improved
    performance
  • Caution
  • Increase in concurrency must be used well for
    performance to improve

98
Group IPC in ISIS ISIS Primitives
  • Three variations on group communication
  • ABCAST
  • Loosely synchronous communication for data
    transfer to a group
  • CBCAST
  • Virtually synchronous communication for data
    transfer to a group
  • GBCAST
  • Loosely synchronous and similar to ABCAST
  • Used to manage group membership not data transfer

99
Group IPC in ISIS ISIS Primitives
  • ABCAST
  • Originally used a 2 phase commit protocol
  • Guarantees in order delivery
  • Complex and Expensive
  • Sender timestamps (virtual time) a message
  • Receiver sends a reply with timestamp large than
    any it has seen
  • Sender sends the Commit message with timestamp
    larger than all replies
  • Timestamps used to ensure committed messages are
    delivered to the application in timestamp order

100
Group IPC in ISIS ISIS Primitives
  • CBCAST
  • Created because ABCAST is so expensive
  • Controls the delivery order only of the causally
    related messages
  • Each group member maintains vector holding last
    message number from each member
  • Member i increments its number when sending and
    includes the vector in the message
  • Receiver uses global state given by vector to
    determine causal relation of received message to
    any it has not yet received
  • Figure 2-38 page 113 Tanenbaum

101
Summary
  • The dominant difference between centralized and
    distributed systems is the importance of
    communication in influencing
  • Semantics
  • Performance
  • Classic layered protocols are fine for relatively
    slow widely dispersed systems
  • Too slow for tightly coupled systems
  • Thinner protocols used in tightly couple LAN
    based systems
  • Communication semantics and primitives are the
    place to start when designing a distributed system

102
Summary
  • Client/Server is classic and useful
  • Limited by the fact that IPC is handled as
    process I/O and thus the applications problem
  • Little distribution transparency
  • RPC moved this into the system programming level
    by using the procedure abstraction
  • Substantial distribution transparency
  • Compiler and library level (middleware)
  • RPC still hard and not fully transparent because
    it still uses point-to-point communication
  • RPC also limited to pair-wise relations

103
Summary
  • Collections of interacting components require
    more powerful communication support from the
    system
  • Middleware, OS, Network as required
  • Systems such as ISIS provide a new abstraction
  • Group Communication
  • Several semantic variations used to provide
    different semantics vs. cost choices
  • Principle Weaker semantics usually cheaper and
    usually permit more concurrency
  • ISIS algorithmic approaches are similar to those
    discussed in Chapter 3
About PowerShow.com