Communication in Distributed Systems - PowerPoint PPT Presentation


PPT – Communication in Distributed Systems PowerPoint presentation | free to view - id: 9297c-NzlhZ


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Communication in Distributed Systems


IP switching at Gb speeds emerging hot topic. IP directly on SONET ... New network technologies challenged system designers to reduce system latency ... – PowerPoint PPT presentation

Number of Views:898
Avg rating:3.0/5.0
Slides: 104
Provided by: douglas122


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Communication in Distributed Systems

Communication in Distributed Systems
  • EECS 750
  • Spring 1999
  • Course Notes Set 2
  • Chapter 2
  • Distributed Operating Systems
  • Andrew Tanenbaum

  • Obvious
  • It cant be a distributed computation if there is
    no communication
  • The communication media and protocols have a
    profound influence on performance
  • Less Obvious
  • Communication media and protocols can take many
  • Multiprocessor (SMP) and Bus contention
  • Hyper-cube and link contention
  • Ethernet, CDMA, and binary back-off

  • Protocols have two major design influences
  • Communication medium properties
  • Application domain properties
  • Implication
  • Diversity of application domain properties has a
    dumbing down effect
  • Protocols are selected for a compromise between
    competing properties
  • Breadth of application support - the lowest
    common denominator
  • Efficiency of application support

  • Layered Protocols
  • Hierarchy preserves the sanity of humans
  • Standardized for interoperability makes
    distribution more feasible
  • Common denominator phenomena
  • TCP/IP created before international standards
  • ISO-OSI reference Model
  • Standard 7 layers few if any use it literally
  • Fig 2-1, page 37 Tanenbaum
  • Design principle used in most distributed systems

  • Connection Oriented vs. Connectionless
  • Good for different application domains and use
  • Reliable vs. Unreliable message delivery
  • Layering of protocols generally means layering of
    the packaging around the messages
  • New wrappers on messages as they descend
  • Stripping layers as they ascend
  • Fig 2-2 page 38
  • Principles also used for communication at the
    middleware and application layers

Protocol Layers
  • Protocol at each layer specialized to the duties
    of the layer
  • Client of the services provided by the layer
  • Provider of the services used by the layer above
  • Continuum of protocols and relationships from the
    lowest physical layer to the highest layer in a
    distributed system
  • Major conceptual boundary between protocols
    provided by parts of the system implemented by
    different groups
  • OS vs. Middleware Middleware vs. Application

Protocol LayersMajor Divisions
  • Physical and Data link
  • Physical devices and device drivers
  • Network
  • Component connectivity (IP)
  • Transport
  • Message transfer from end to end (TCP)
  • Session
  • Application relevance and semantics
  • Middleware squeezes in below here but above

Protocol LayersPhysical and Data link
  • Manages bit transfer across a physical link
  • Serial
  • Ethernet
  • ATM
  • PCI or VME bus
  • Physical and electronic magic
  • From our (my) point of view
  • Magic bit pipe
  • Services of physical devices for which we may
    write drivers but otherwise electronic wizards
    and catalogs provide these services

Protocol LayersNetwork
  • View of the set of links among several
    communicating hosts (IP)
  • Each host certainly knows with what other hosts
    it can communicate
  • It may or may not know about the connectivity of
    other hosts
  • This provides a view of the network
  • Connection or connectionless becomes meaningful
    only here
  • Connection is a commitment and/or concept which
    spans 2 or more host-host links

Protocol LayersNetwork
  • X.25 is a familiar connection oriented protocol
  • IP is a familiar connectionless protocol
  • Routing refers to the problem of deciding how a
    message is taken from its source to its
  • Link by Link (Hop by hop)
  • End to End
  • Static or Dynamic routing
  • Entire seminar on IP routing strategies and
    implications (Evans and Whiting)
  • Reliable or Unreliable connections

Protocol LayersTransport
  • Provides more complex (and often specialized)
    semantics by using services of the network layer
  • Reliable connections on top of an unreliable
    underlying network
  • TCP on top of IP does this
  • TCP does other things as well
  • Provides the notion of a connection from source
    to destination socket
  • Exercises flow control on the traffic
  • Based on assumptions that are not always true
  • UDP is a connectionless option at this layer

Protocol LayersSession
  • This layer and above is usually provided by the
    application code in most systems
  • Middleware (CORBA/DCOM) assuming this role in
    many cases over the last few years
  • Inserts a layer between transport and session
  • OR - creates a lowest sub-layer
  • This layer often provides memory and context
  • Basis of connection restoration for fault
    tolerance or mobility

Protocol LayersSession
  • Parallel Virtual Machine (PVM) is plausibly
    middleware providing a session to its users
  • CORBA name service and brokers provide name
    spaces and abstractions above simple connections
  • Host/network speed ratio was large for decades
  • OC-3 is roughly the bandwidth of basic SCSI
  • Network disks suddenly seemed sensible
  • Affects the feasibility of distributed systems
  • Increases application inter-operability
  • Decreases implementation effort

High Performance Networks
  • Recent (last 5 years or so) introduction of high
    performance networks has significantly affected
    distributed system design and popularity
  • ATM (155 Mb (OC-3) and 622 Mb (OC-12))
  • 100 Mb and 1Gb Ethernet
  • Orders of magnitude faster networks affect host
    software and hardware design

High Performance NetworksAsynchronous Transfer
  • Introduced a significantly faster medium
  • Shifted ratio of host to network
  • Changes economics of distributed systems by
    shifting cost/performance ratio
  • Connection oriented
  • Switched Virtual Circuits (SVCs)
  • Single Network for multiple traffic types
  • Voice, data, video, sound, games
  • Economy of scale and common denominator makes it
    attractive to many without being best for any

High Performance NetworksAsynchronous Transfer
  • Simplicity of cells and their processing made HW
    based cell switching possible
  • IP switching at Gb speeds emerging hot topic
  • IP directly on SONET
  • Following and responding to ATM challenge
  • Switching fabrics have multi-Gb bandwidth
  • Sub-microsecond cell switching time
  • (Port, VC) to (Port,VC) mapping in switches
  • VC multiplexing on physical links
  • Details in other classes - not usually relevant

High Performance NetworksFast Ethernet
  • 100 Mb switched Ethernet has emerged as a
    significant competitor to ATM
  • Comparable speeds
  • Generally less expensive
  • Different name space for data transfer
  • Host to Host not Virtual Circuits
  • Medium often still shared
  • Among a small number of hosts
  • Compelling price/performance
  • Particularly at first network hierarchy level
  • Gb Ethernet emerging

High Performance NetworksImplications for DS
  • Implications of ATM and Fast Ethernet for
    Distributed Systems are similar but not identical
  • Vast increase in bandwidth is major effect
  • 10 Mb shared to 155 Mb (622) switched
  • Consider the rough equivalence of OC-3 and
  • All disks on the network are created equal
  • With respect to bandwidth not latency
  • Faster disk and faster/wider SCSI restored a two
    tiered hierarchy

High Performance NetworksImplications for DS
  • Speed vs. Latency
  • Vitally important when considering DS design,
    implementation, and evaluation
  • Different and both important
  • In many DS, message transmission time approaches
    medium latency (L)
  • Messages are small
  • Total Time L(sender) transit time
  • Latency is asymptotic to light speed
  • Many people could have avoided a lot of effort if
    they assumed light speed

High Performance NetworksImplications for DS
  • If it still wont work when you assume light
  • Try another approach
  • Difference between real and light speed latency
    is determined by the medium and the system
  • System latency often predominates
  • Not always
  • It is the part over which we have the most
  • New network technologies challenged system
    designers to reduce system latency
  • Increased network speed left OS bottleneck

High Performance NetworksImplications for DS
  • Design Principle
  • Optimize the most significant component
  • Success shifts the focus and relevant design
  • Flow Control
  • Also significantly affected by new technology
  • Bandwidth-Delay product gives NW buffering
  • Also reveals potential waste in interactive apps
  • Significant increase in buffering a good example
    of the somewhat unexpected stresses on hosts

High Performance NetworksImplications for DS
  • Protocol Adjustment
  • Motivated by new network technologies and new
    classes of applications
  • Active Networks - dynamic customization
  • Cell level pacing
  • Earliest packet discard
  • Differential Service in IP - Diffserv
  • Some believe IP already perfect (fewer and fewer)
  • Adaptations of IP a current area of frenzied

  • Currently the most popular and ubiquitous method
    of implementing distributed applications
  • Critics and alternatives emerging
  • Not transparent to location and identity
  • But still useful
  • Fig 2-7 page 51 Tanenbaum
  • Request-Response style application level
  • Note that the client must know the server
  • (Host-IP, Port) is a unique ID in name space
  • INETD generalizes this wrt port number

Client/ServerName Servers
  • Name servers generalize the host requirement
  • CORBA name service a good example
  • Creates a new name space
  • Name servers map between name spaces
  • Client must know the name server ID in the basic
    name space, but no other
  • Fig 2-10 page 57 Tanenbaum
  • Different physical machines can play a given role
    at different times
  • Service oriented names ( can be
    mapped to different machines as required

Client/ServerCanonical Problems
  • Name allocation and advertising
  • How does a client know the name of the name
    server and how to locate it
  • Servers can broadcast
  • Scalability of a given name space and server
  • LAN to WAN and World-WAN
  • Classic problems for distributed systems
  • Many management functions at the OS ad
    middleware levels have a strong client-server

Client/ServerBlocking vs. Non-Blocking
  • Client-Server interaction semantics
  • Blocking is synchronous
  • Non-Blocking is asynchronous
  • Fig 2-11 page 59 Tanenbaum
  • Semantic constraint
  • Sender cannot change a message buffer until it is
  • Easy to determine for synchronous operations
  • Harder for asynchronous operations
  • Signal for buffer release is a form of

Client/Server Blocking vs. Non-Blocking
  • Design Principle
  • Weakening the semantics of an operation generally
    increases its performance by decreasing the
  • Separate issues of buffer safety and message
  • Transmission vs. Delivery distinction
  • Weakening synchronization to buffer safety
  • Addresses the real constraint
  • Increases available concurrency
  • Potentially increases system performance

Client/Server Blocking vs. Non-Blocking
  • Semantic choices interact
  • System call blocking semantic choices
    (strong to weak)
  • Message delivery
  • Message transmitted
  • Message request noted by OS
  • Message buffer safety interacts with return
  • Message copied into OS address space makes user
    buffer safe earlier than not
  • Overhead cost

Client/Server Buffered vs. Unbuffered
  • Where is the information buffered
  • Affect on communication semantics
  • Affect on system overhead
  • Interactions with other aspects of user
    computation semantics
  • Signals used to announce message buffer safety or
    message delivery may create critical sections
  • Basic choices
  • Block until delivery or sent
  • Block and copy to OS
  • Non-blocking and use user buffer

Client/Server Buffered vs. Unbuffered
  • This generally considers send synchronization
  • Receive synchronization can also matter
  • Block until message arrives (common default)
  • Return if no message available
  • Design tradeoffs
  • Cost vs. Ability
  • Simplicity vs. Ability
  • OS provides receive buffers to avoid interrupting
    the process when a message arrives
  • Pays the price in overhead of space and copy

Client/Server Buffered vs. Unbuffered
  • What choice is best depends on required and
    desirable semantics vs. cost
  • Distributed systems are often willing to cope
    with less forgiving and restricted semantics to
    gain performance
  • Both approaches are best in some cases
  • Most systems provide several sets of
    communication primitives wit varying semantics
  • Most systems default to buffered approach
  • Important to know what choices are being made for

Client/Server Reliable vs. Unreliable
  • Underlying physical networks are all subject to
    different rates of message loss
  • Common units used bit error rates
  • A bit error usually means a lost packet at the
    transport (TCP) level
  • High rate of uncorrelated errors would make
    simple error correction encoding attractive
  • Digital radios are an example
  • Note the interaction among abstractly separate
  • Error rate at low level may affect higher layer

Client/Server Reliable vs. Unreliable
  • Design Issue
  • Should the programming interface provide reliable
    or unreliable communication primitives

Client/Server Reliable vs. Unreliable
  • Design Issue
  • Should the programming interface provide reliable
    or unreliable communication primitives
  • It Depends
  • Semantics of applications differ
  • Importance of performance vs. loss vs. latency
  • UDP for high performance, low latency, loss
  • TCP for reliable in-order delivery
  • Multi-media stresses low latency over low loss

Client/Server Design Choices Summary
  • Competitive relations among
  • Ease of use
  • Bandwidth
  • Latency
  • Efficiency
  • Generality
  • Figure 2-14 page 65 Tanenbaum
  • System Programmers Full Employment Policy
  • No final best solution
  • Tradeoffs always shift with 10s of factors

Remote Procedure Call
  • Extends Client-Server by concealing distribution
    and the associated communication at the
    programming interface
  • Everything looks like a procedure call
  • Unifies centralized and distributed computing at
    the programming interface level
  • Approaches the goal of a Uniform and Universal
    Virtual Machine
  • Everyone (almost) shares this goal in general
  • Not everyone shares this goal in detail
  • Cost makes the difference

Remote Procedure Call
  • RPC layer shoves the distribution down a layer in
    the programming model
  • Generalizes the procedure call
  • Early effort to support distribution
  • Great but many subtle implications that affect
    cost, utility, applicability
  • Excellent example of how distribution is almost
    always harder than it first appears

Remote Procedure CallAdvantages
  • Write a program in a normal and conventional way
  • All procedure calls are written the same in the
    bulk of the code
  • Declarations of procedures may or may not
    identify them as remote
  • Promotes portability in the sense that the same
    code can be mapped onto a range of distribution
  • Several choices about how to describe
    distribution and the assignment of servers for
    specific procedures - both compile and run time

Remote Procedure CallDisadvantages
  • Operations can be done in arbitrary places
  • Advantage and Disadvantage
  • Different hardware
  • Different instructions mean complex compilation
  • Different data formats mean format tracking and
    conversion overhead
  • Different Address Spaces
  • How much of the address space do you need

Remote Procedure CallDisadvantages
  • Parameters and additional context must be
    transferred from client to server
  • Determining the required context can be difficult
  • Too little incorrect computation semantics
  • To much unnecessary overhead
  • Implications of the programming model
  • Pure Functional Programming which uses only
    passed parameters with no side effects
  • Other paradigms may effectively require the whole
    address space

Remote Procedure CallBasic RPC
  • Goal is for RPC to be transparent
  • Everything is a procedure call - local or remote
  • Issues arise from unifying interface model
  • Procedure calls
  • Call stack
  • Regular procedure calls transfer control within a
    single address space
  • System calls use the same model but transfer
    across user-system address space boundary
  • Requires special code in the system call to
    transfer parameters and give OS access to all
    necessary context

Remote Procedure CallBasic RPC
  • RPC needs to preserve the procedure call
  • On both the client and the server
  • Send the procedure context to the server
  • Return the results back to the client
  • Language semantics have a huge affect on context
    and how hard it is to figure out
  • Call by value vs. Call by reference
  • Multiple copies and resulting coherence issues
  • Implications have implications in a ripple effect

Remote Procedure CallBasic RPC
  • Implementation through stubs
  • Figure 2-18 page 71 Tanenbaum
  • Client stub
  • Represents client side duties of the RPC
  • Assembles RPC context (marshalling)
  • Receives results (demarshalling)
  • Server stub
  • Unpacks context
  • Calculates
  • Packs results

Remote Procedure CallBasic RPC
  • Consider Client/Server implementations of system
    services using RPC interfaces
  • Generality and portability in support of a wide
    range of distribution scenarios
  • Application and/or analogy to micro-kernel
  • Multiple designs/models/presentations of a given
    service provided by the procedure
  • Analogy to overloading in object-oriented systems
  • Compile time or run time selection of service

Remote Procedure CallParameter Passing
  • Client stub receives parameters in the normal way
  • It is after al a procedure
  • It collects context and packs it into a message
  • Sends message
  • Data format conversion may take place on either
  • Data representation standards - canonical format
  • CORBA data interchange format
  • RPC components written by different people
  • Stub generation by compiler
  • Commonly automates marshalling code

Remote Procedure CallParameter Passing
  • Automation and run-time decisions add to
    distribution transparency
  • Pointers!!!!
  • The entire address space is the context!
  • Dynamic binding
  • Clients find servers at run time
  • Clients determine context at run time
  • CORBA addresses a lot of these issues
  • Subsumes and generalizes RPC in the object
    oriented framework

Failure Handling
  • Making the distribution of the system or
    application transparent is hard
  • Handling some failures and masking the occurrence
    of others is one reason
  • Failures particular to distributed (RPC)
  • Cannot find server
  • Lost request client to server link
  • Lost reply server to client link
  • Client crashes after sending request
  • Server crashes after receiving request
  • Good canonical set of failures covering most types

Failure HandlingCannot Find Server
  • Maybe the server is down
  • Dynamic server selection attractive
  • Distribution introduces new types of errors
  • Wider range of operations is done remotely
  • Return value of -1 and setting errno not really
    expressive enough any more
  • Signals and exceptions are candidates for
    expanding error handling semantics
  • Motivation for language extension
  • transparency challenge

Failure HandlingLost Request
  • Easiest error to understand and handle
  • Client timeout is common obvious remedy
  • Confusion of message latency and loss
  • Network or slow server latency interpreted as
    loss or server unavailable
  • ITTC AMD times out before ATM support for NFS
    works so AMD decides ATM broken
  • Timeouts are a trap wrt graceful degradation
  • A system slows as it handles more
  • Suddenly timeouts expire and everything is
    declared broken

Failure HandlingLost Reply
  • Much harder
  • Obvious solution is another timer
  • Could just be a slow server
  • Server must be written to handle duplicate
    requests unless the supporting network gives
    explicit loss indication
  • Client must be written to handle duplicate
  • Client-Server protocol should distinguish lost
    replies and a slow server
  • Semantics of the operation are involved
  • Come can be repeated, some cannot, as some have
    side effects

Failure HandlingLost Reply
  • Idempotent
  • Denotes an operation that can be repeated without
  • Distributed system designers thus wish to
    maximize the number of idempotent operations
  • Cannot be universal bank transfers
  • Protocol encapsulating client-server interaction
    handling repeats and race conditions
  • Sequence numbers on messages help
  • Flag denoting a repeated request

Failure HandlingServer Crash
  • Difficulty here is related to idempotency
  • Sequence numbers are not enough to fix it
  • Consider Fig 2-24 page 82 Tanenbaum
  • Server crashes before fully serving request (b)
  • Crashes after serving but before replying (c)
  • Handled very differently
  • B requires the server to redo a partially
    completed (possibly ATOMIC!) transaction
  • C merely requires repeating the reply
  • BUT the client cannot tell the difference

Failure HandlingServer Crash
  • Different Semantic Categories
  • At Least Once
  • Guarantees RPC completes at least once
  • Maybe more than once
  • At Most Once
  • Guarantees RPC done at most once
  • Maybe zero
  • No Guarantees
  • Server provides no help and no promises
  • Luck of the draw mostly still works OK

Failure HandlingServer Crash
  • Exactly Once
  • Most desirable but generally not possible
  • What can a crashed server know about what it did
    and did not do?
  • Data Base perspective helps
  • Atomic transactions
  • Commit and rollback concepts
  • Raises the cost of RPC
  • Increases the granularity of distribution
  • Pretty emphatically violates transparency of

Failure HandlingClient Crash
  • Client crashes after making a request
  • Creates an orphaned computation on server
  • Orphans waste server time and decrease
  • Further descendants are possible
  • Grand orphans and so on
  • Nested RPC
  • Four solutions proposed by Nelson (1981) are
    instructive wrt complication of designing
    distributed algorithms

Failure HandlingClient Crash
  • Solution 1 Transaction Perspective
  • Log requests on a medium that survives crashes
  • Synchronous disk writes
  • Delete orphans on recovery
  • Called extermination
  • Huge overhead of logging every RPC
  • Hard to ensure killing grand orphans
  • Finding them
  • Ability to tell their server to kill them
  • Big time delay during reboot

Failure HandlingClient Crash
  • Solution 2 Epochs
  • Known as reincarnation
  • Rebooting clients increment epoch number thus
    dividing the time line into periods between
    system boots by epoch number
  • Broadcast to servers as they establish relations
  • Servers delete requests associated with previous
  • Replies are tagged with epoch number making
    obsolete ones easy to eliminate

Failure HandlingClient Crash
  • Solution 3 Gentle Reincarnation
  • Epoch number from booting host broadcast
  • Each machine, client or server, sees if it has
    any remote computations
  • Only if the owner of the remote computation
    cannot be found is the computation killed
  • Recycle and reconnect remote computation

Failure HandlingClient Crash
  • Solution 4 Expiration
  • Each RPC is given a standard amount of time
  • Over-run requires explicit request of another
    quantum T
  • If a server crashes it waits T before rebooting
  • All orphans are thus assured of having timed out
  • Choosing T is a compromise between throughput and
    latency as well as other forms of overhead

Failure HandlingSummary
  • In practice none of these is all that attractive
  • They illustrate the complications of doing
    distributed systems
  • Correctly
  • Efficiently
  • Transparently
  • Killing orphans may cause problems
  • Did they hold locks?
  • Orphan side effects - non-idempotent

Implementation Performance
  • Details determine the fate of a distributed
  • Implementation cost in time and resources
  • Performance of the system produced relative to
    optimal performance for the applications
  • Note that optimal performance does not
    necessarily mean maximum performance
  • Semantics problems increase overhead
  • Custom vs. Existing Protocols
  • Ease of using existing protocols
  • Lower efficiency due to lack of specialization
  • Specialization is brittle wrt design assumptions

Implementation PerformanceProtocol Latency
  • Tradeoff between positive and negative cases
  • Successful communication delayed by overhead of
    protocol features supporting error recovery
  • Error recovery delayed by lack of support
  • Goal Minimize both overhead on successful cases
    and delay in recovering from errors
  • Inventing perpetual motion would be good too
  • Most of the time these goals compete
  • Expected Delay is analogous to expected value
  • Important statistical measure
  • So are minimum and maximum

Implementation PerformanceCritical Path
  • General design and analysis principles
  • Optimize the subsystem(s) executed most
  • Figure 2-26 page 89 Tanenbaum
  • Illustrates RPC critical path - client to server
  • Optimize the biggest Component
  • Usually data copying is the biggest component
  • It Depends on the load assumed
  • Copying increases with data volume (obvious)

Implementation PerformanceCritical Path
  • Scatter Gather
  • Network protocols and/or interface hardware can
    deal with layered wrappers of data packets added
    by protocol layers being in disjoint memory
  • Major (original) motivation for inventing Mbuf
    data structure
  • Separates management information from data buffer
    - separate pools
  • Offset Counter
  • Double pointer set next and continue

Implementation PerformanceCritical Path
  • Role of Virtual memory
  • Twiddling memory maps and pointers is often much
    faster than copying the data
  • Common way to avoid copying across system-user
    address space boundary
  • Motivation for separating buffer from header
  • Always (almost) a break-even packet size
  • Should be determined by measurement
  • Bottom line Does copying or administrative magic
    cost more under various operating conditions
  • Faster networks push current system limits

Implementation PerformanceTimers
  • Differences in time scales among critical path
    components can introduce HUGE latencies
  • Delays of lower component leaves unused gaps in
    utilization of faster component
  • Timeouts
  • Generally very long by the time scale of the
    fastest associated component
  • TCP timeout commonly 500 ms
  • Assumption most timeouts do not occur so setting
    and unsetting them is overhead
  • TCP over ATM even one timeout can lower
    effective throughput

Implementation PerformanceTimers
  • Increase in network speeds is changing the time
    scale that matters for system software design and
  • Common granularity of 10 ms is increasingly
    coarse compared to scale of system components
  • DEC UNIX lowered granularity to 1 ms
  • Heart Beat timing method
  • Periodic timer interrupt (10 ms or 1ms)
  • Each tick executes timer interrupt service
    routine which increments software clock
  • Also checks for events which become active

Implementation PerformanceTimers
  • Heartbeat method has several implications
  • Scheduled event times are quantized to heartbeat
    scale (twice tick size)
  • At least next tick after desired time
  • Measured times are either limited to tick scale
    or must use another method
  • Time stamp counter in Pentium and Alpha offers a
    CPU clock tick counter
  • Some systems use both
  • Provide finer granularity for time of day and
    time stamps
  • Not interrupt driven time scales

Implementation PerformanceTimers
  • Finer time scale for scheduled events also
  • Each interrupt adds overhead
  • Decreasing tick size from 10ms to 1 microsecond
    increases overhead by 104
  • Faster networks may need finer grain time scale
  • 100 Mb/s network - 12 MB/s - 120 KB/10 ms
  • 6 MB in 500 ms TCP timeout
  • Tough problem
  • Motivation for finer granularity also urges low
    overhead for setting and unsetting timer events

Implementation PerformanceManaging Scheduled
  • Managing a set of scheduled events is a basic
    system service
  • Two common methods timeout list and process
    table sweep
  • Timeout list
  • List of events sorted by ticks until event
  • Insertion and deletion requires O(N) on linked
    list implementation
  • Latest Linux is a Heap
  • Tick ISR decrements front element
  • Event occurs when it reaches zero

Implementation PerformanceManaging Scheduled
  • Sweep method
  • Events are associated with PCB of process
  • Sweep the PCB table every quantum
  • Assume coarse granularity, single timer per
  • Principle optimize the most common case
  • Comparison
  • Timeout minimizes tick ISR and permitting finer
    granularity with more insertion overhead
  • Sweep minimizes event creation but is more limited

Implementation PerformanceDesign Principles
  • Always measure
  • Even the smartest are regularly surprised
  • Always easy to understand after you know it is
  • All solved problems are trivial - J. Hilliard
  • Optimize the system components in reverse order
    of size in the parameter being optimized
  • Expected Value Technique
  • Use probability of an execution path to guide
  • E.G. Optimistic Concurrency Control

Implementation PerformanceOther Issues
  • RPC implemented on client/server sockets is a
    common approach
  • Design Challenges
  • Competition between transparency and acceptable
  • Global Variables are still common making
    determination and handling of sending RPC context
    to the server really hard
  • Errno is part of the POSIX standard
  • Full transparency would require errno to always
    be set correctly and in the same ways as on a
    uni-processor system

Implementation PerformanceOther Issues
  • Design Challenge
  • Extending uni-processor behavior to
    multi-processor and distributed multi-computer
  • Hard enough to correctly make the published
    standards transparent
  • True transparency requires reproducing the
    undocumented behavior
  • This is the reason programs can work correctly
    for the wrong reasons
  • Good SWE and portability means you should always
    keep it simple and standard

Implementation PerformanceCompiler and
Language Support
  • Compilers and operating systems implement
    different aspects of the programming model
  • Very Important Issue and Conference
  • Architectural Support for Programming Languages
    and Operating Systems (ASPLOS)
  • Strong typing provides rich information on data
    structures and operations under RPC
  • More information promotes better efficiency and
    transparency when marshalling
  • Language design strongly influences what is
    required for the context of an RPC

Group Communication
  • Question
  • How many fundamentally different kinds of
    distributed computing scenarios exist?
  • No is sure - everyone is pretty sure nobody knows
    the answer
  • Question is also too broad considering several
    categories of things together that can be
    contemplated separately
  • Better Question
  • How many different types of communication would
    the full range of distributed computing scenarios
    require as support

Group CommunicationTypes of Communication
  • RPC
  • Illustrates many important principles
  • Limited to two parties with a fairly strict
    protocol limiting exchanges between them
  • Distributed computing includes scenarios with
    many cooperating components
  • Example a set of cooperating files servers
    providing transparent fault tolerant service
    based on redundancy and active backups
  • File server communication patterns will not map
    onto pair-wise client server or RPC very well

Group CommunicationTypes of Communication
  • RPC and client/server are point-to-point
    communication modes
  • One-to-one communication is important
  • One-to-many communication is also important in
    many distributed computing situations
  • Especially the more complicated ones
  • Dynamic communication group membership
  • Implementation of group communication and its
    efficiency depends on properties of the
    underlying network and hardware
  • Hardware properties and support vary a lot

Group CommunicationOne to Many Communication
  • Several forms of group communication differing in
    implementation and semantics issues
  • Group membership and naming
  • Closed or Open Group
  • Peer or Hierarchy
  • Group Management
  • Addressing and IPC primitives
  • Atomicity
  • Message ordering
  • Scalability

One to Many CommunicationMulticast
  • Broadcast is multiple copies of a message to all
  • Multicast is multiple copies of a message to a
    specific set of machines
  • Obvious advantage in network overhead
  • Obvious overhead in managing the group
  • Network layer support is the best way although
    user level libraries exist
  • Hardware support better
  • Network properties have a significant influence
  • Consider Ethernet vs. ATM

One to Many CommunicationClosed vs. Open Group
  • Closed
  • Communication only among group members
  • Open
  • Outside processes may send messages to the group
    as well
  • Choice is controlled by application level issues
    motivating the group communication
  • Set of players in an interactive virtual
    environment is plausibly closed
  • Set of servers providing fault tolerant services
    would not be closed

One to Many CommunicationPeer vs. Hierarchic
  • Group authority and responsibility relationships
    affect message
  • Destination
  • Processing
  • Overhead
  • Master/Slave structure might have all messages
    flow through the master
  • Single point of control
  • Single point of failure
  • Fault tolerant servers might use peer
  • Control more complex After you. No, after you

One to Many CommunicationGroup Management
  • Group server is required
  • Track group membership
  • Manage group communication at some level
  • Single server is a single point of failure
  • Multiple servers must exchange information and
    handle distributed data with multiple copies
  • Enter and Leave Semantics
  • Handle crashes and thus incarnations of group
  • Resolve distributed server conflicts

One to Many CommunicationGroup Management
  • Synchronization of messages and process
  • Must a processes receive messages sent before
    they left
  • A Process certainly must not receive messages
    sent before they joined or after they left
  • What clock decides when a member leaves and thus
    what before and after mean?
  • Group recovery with multiple crashes
  • No general solution it depends on application
    and situation semantics

One to Many CommunicationGroup Addressing
  • Group name space
  • Name space of components which form groups
  • Name space of groups to which messages may be
  • Not all implementations are created equal
  • True Multicast one send, specific receives
  • Broadcast one send, all receive but OS discards
  • Unicast multiple sends
  • Figure 2-33 page 104 Tanenbaum
  • What kind and amount of overhead and who pays

One to Many Communication Group Addressing
  • Predicate Addressing
  • Message contains a predicate evaluated by the
    destination which evaluates to keep or discard
  • Automatically enables dynamic group addressing
  • Increases the importance of the name space
    describing the components
  • Capability and difficulty to forma predicate
    specifying any set of components
  • Interesting power and flexibility for
    sophisticated distributed computing scenarios
  • Restricted form of active networks

One to Many CommunicationIPC Primitives
  • RPC is too restrictive
  • Various forms of send and receive are common
  • Modifications as required to change or constrain
    the communication semantics
  • Design Decisions
  • What part implements the new semantics
  • What interface changes are implied
  • Another motivation for active networks
  • Transparency
  • Same primitives using sockets
  • Socket attributes choose Point-to-Point or

One to Many CommunicationAtomicity
  • All-or-Nothing property of group communication
  • Either every destination receives the message or
    none do
  • Makes implementing many distributed algorithms
    much easier
  • Otherwise the application layer must deal with
    more common and complex inconsistencies
  • Harder than it seems
  • As usual

One to Many Communication Atomicity
  • ACK overhead to ensure reliable delivery
  • Failure tolerance
  • Crashed Senders and Receivers
  • Receive and Forward Strategy
  • Every component receiving a message for the first
    time forwards it to all other group members
  • All surviving members guaranteed to get message
  • Bad News Overhead
  • messages for N group members
  • Worse news
  • This is actually Good, or at least not bad

One to Many CommunicationMessage Ordering
  • Two properties make group communication much
    easier for implementing distributed computations
  • Atomic Broadcast - all or none reception
  • Ordered message delivery
  • Problem
  • Messages sent across networks incur delay
  • Messages from various sources to various
  • A given set of messages may arrive at a given
    destination in a different order
  • Figure 2-34 page 108 Tanenbaum

One to Many Communication Message Ordering
  • Consider a distributed transaction (A Bank)
  • Message (deposit, withdrawal) order is important
  • Best Solution
  • Instantaneous delivery to all destinations
  • Right after perpetual motion
  • Global Time Ordering
  • Messages are delivered in the order they were
    sent according to a global clock
  • Einstein demonstrated that no such absolute
    global time exists

One to Many Communication Message Ordering
  • A DS could implement an Absolute time ordering by
    imposing synchronization and a global heartbeat
  • Constrains concurrency
  • Slows things down - in general
  • Consistent Time Ordering
  • Relaxes global time constraint (unneeded)
  • All messages arrive in the same order at all
    group members
  • Arrival order may not match global time
  • Still hard and weaker semantics are still useful

One to Many CommunicationScalability
  • Many approaches to communication work well for
    small groups but not for large groups
  • They are still good for distributed computations
    with a modest number of components
  • Consider the implications of a cast of thousands
  • Obvious increase in the number of messages
  • Subtle change in network properties and
    assumption validity moving from LAN to WAN
  • Multicast aware gateways and routers required
  • Computational complexity and centralization often
    constrain scalability

Group IPC in ISIS
  • Project explored question of what basic
    communication primitives are required to support
    distributed computing
  • ISIS is a toolkit for building distributed
  • Horus was its successor and both commercialized
  • Ensemble is the successor to Horus and free
  • ISIS is a set of programs, utilities, and library
    code that run on top of common OS platforms
  • Implements several types of synchronization
  • Illustrates tradeoff of semantic strength against
    implementation and execution cost

Group IPC in ISISSynchronization Variations
  • Synchronization variants differ in semantic
    strength and implementation cost
  • Synchronous
  • Every event happens strictly separately,
    sequentially, and instantaneously
  • Impossible to build - weaker semantics required
  • Loosely Synchronous
  • Events take finite time and appear in the same
    order to all group members
  • Possible but expensive to build
  • Weaker semantics are cheaper and often sufficient

Group IPC in ISIS Synchronization Variations
  • Virtually Synchronous
  • Relaxes the in order delivery constraint
  • Carefully so that the variation in message
    delivery order at different receivers will not
    matter under specific criteria
  • Causally Related Events
  • Two events are causally related if the nature or
    behavior of the second may have been influenced
    by the first
  • Delivery order of causally related events
    matters, that of unrelated events does not

Group IPC in ISISVirtually Synchronous
  • Unrelated events are concurrent
  • Causally related events are delivered in order
  • Others may or may not be
  • Weakening semantics
  • Decreases overhead
  • Increases concurrency
  • Opens the opportunity, at least, for improved
  • Caution
  • Increase in concurrency must be used well for
    performance to improve

Group IPC in ISISISIS Primitives
  • Three variations on group communication
  • Loosely synchronous communication for data
    transfer to a group
  • Virtually synchronous communication for data
    transfer to a group
  • Loosely synchronous and similar to ABCAST
  • Used to manage group membership not data transfer

Group IPC in ISIS ISIS Primitives
  • Originally used a 2 phase commit protocol
  • Guarantees in order delivery
  • Complex and Expensive
  • Sender timestamps (virtual time) a message
  • Receiver sends a reply with timestamp large than
    any it has seen
  • Sender sends the Commit message with timestamp
    larger than all replies
  • Timestamps used to ensure committed messages are
    delivered to the application in timestamp order

Group IPC in ISIS ISIS Primitives
  • Created because ABCAST is so expensive
  • Controls the delivery order only of the causally
    related messages
  • Each group member maintains vector holding last
    message number from each member
  • Member i increments its number when sending and
    includes the vector in the message
  • Receiver uses global state given by vector to
    determine causal relation of received message to
    any it has not yet received
  • Figure 2-38 page 113 Tanenbaum

  • The dominant difference between centralized and
    distributed systems is the importance of
    communication in influencing
  • Semantics
  • Performance
  • Classic layered protocols are fine for relatively
    slow widely dispersed systems
  • Too slow for tightly coupled systems
  • Thinner protocols used in tightly couple LAN
    based systems
  • Communication semantics and primitives are the
    place to start when designing a distributed system

  • Client/Server is classic and useful
  • Limited by the fact that IPC is handled as
    process I/O and thus the applications problem
  • Little distribution transparency
  • RPC moved this into the system programming level
    by using the procedure abstraction
  • Substantial distribution transparency
  • Compiler and library level (middleware)
  • RPC still hard and not fully transparent because
    it still uses point-to-point communication
  • RPC also limited to pair-wise relations

  • Collections of interacting components require
    more powerful communication support from the
  • Middleware, OS, Network as required
  • Systems such as ISIS provide a new abstraction
  • Group Communication
  • Several semantic variations used to provide
    different semantics vs. cost choices
  • Principle Weaker semantics usually cheaper and
    usually permit more concurrency
  • ISIS algorithmic approaches are similar to those
    discussed in Chapter 3