CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

CS556: Distributed Systems

Description:

Airlines, travel agencies, ... Descendant of CRS (Computerized Reservation System) ... 59 K travel agencies. 450 airlines, 53 K hotels ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 62
Provided by: mar177
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Case Studies of Scalable Systems
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
SABRE
  • Reservations, inventory tracking
  • Airlines, travel agencies,
  • Descendant of CRS (Computerized Reservation
    System)
  • Hosted in a number of secure data centers
  • Connectivity with major reservation systems
  • Amadeus, Apollo, Galileo, WorldSpan,
  • Management of PNRs
  • Passenger Name Records
  • IATA/ARC standards for ticketing
  • 2001
  • 6 K employees, in 45 countries
  • 59 K travel agencies
  • 450 airlines, 53 K hotels
  • 54 car rental companies, 8 cruise lines, 33
    railroads
  • 228 tour operators

3
History (I)
  • 1964
  • Location New York
  • Hosted on 2 IBM 7090 mainframes
  • 84 K requests/day
  • Development cost 400 man-years, 40 M (USD)
  • 1972
  • Location Tulsa, Oklahoma
  • Hosted on IBM 360s
  • The switch caused 15 minutes of service
    interruption
  • 1976
  • 130 travel agencies have terminals
  • 1978
  • Storage of 1 M fares
  • 1984
  • Bargain Finder service

4
History (II)
  • 1985
  • easySabre PCs can connect to the system as
    terminals
  • 1986
  • Automated yield management system (dynamic
    pricing)
  • 1988
  • Storage of 36 M fares
  • Can be combined gt 1 B fare options
  • 1995
  • Initiation of Y2K code inspection
  • 200 M lines of code
  • Interfaces with gt600 suppliers
  • New software for gt40K travel agents
  • 1,200 H/W S/W systems
  • 1996
  • Travelocity.com
  • 1998
  • Joint venture with ABACUS International
  • 7,300 travel agencies, in 16 countries (Asia)
  • 2000

5
Legacy connectivity (I)
  • Connection-oriented comm. protocol (sessions)
  • ALC Airline Link Control protocol
  • Packet-switching
  • but not TCP/IP usually X.25
  • Requires special H/W (network card)
  • Gradual upgrades to Frame-Relay connectivity
  • Structured message interfaces
  • Emulation of 3270 terminal
  • Pre-defined form fields
  • Integration with other systems ?
  • screen scrapping code
  • Message Processors gateways that offer
    connectivity to clients that do not use supported
    terminals
  • Encapsulation of ALC over TCP/IP

6
Legacy connectivity (II)
There is a large market for gateway/connectivity
products - Eg www.datalex.com
7
Porcupine
  • A highly available cluster-based mail service
  • Built out of commodity H/W
  • Mail is important, hard, and easy ?
  • Real demand
  • AOL, HotMail gt100 M messages/day
  • Write-intensive, I/O bound with low locality
  • Simple API, inherent parallelism, weak
    consistency
  • Simpler that DDBMS or DFS
  • Scalability goals
  • Performance linearly scale with cluster size
  • Manageability automatic reconfigurations
  • Availability gracefully survive multiple failures

8
Conventional mail services
Static partitioning of mailboxes - on top of
a FS or DBMS
Performance problems - no dynamic load
balancing
Manageability problems -static, manual
partitioning decisions
Availability problems -If a server goes down,
part of the user population cannot access
their mailboxes
9
Functional Homogeneity
  • Any node can perform any task
  • Interaction with mail clients
  • Mail storage
  • Any piece of data can be managed at any node
  • Techniques
  • Replication reconfiguration
  • Gracefully survive failures
  • Load balancing
  • Masking of skews in workload cluster
    configuration
  • Dynamic task data placement decisions
  • Messages for a single user can be scattered on
    multiple nodes collected only upon request

10
Architecture
Protocol handling
User lookup
Load balancing
Message store access
11
Operation
  • Incoming request send msg to userX
  • DNS/RR selection of node (A)
  • Who manages userX ? (B)
  • A issues request to B for user verification
  • B knows where userXs messages are kept (C, D)
  • A picks best node for the new msg (D)
  • D stores the new msg

Each user is managed by a node and all nodes
must agree on who is managed where
Partitioning of user population using a hash
function
12
Strategy for scalable performance
  • Avoid creating hot spots
  • Partition data uniformly among nodes
  • Fine-grain data partition
  • Experimental results
  • 30-node cluster (PCs with Linux)
  • Synthetic load
  • derived from University servers log
  • Comparison with sendmailpopd
  • Sustains 800 msgs/sec (68 M msgs/day)
  • As compared to 250 msgs/sec (25 M msgs/day)

13
Strategy for reconfiguration (I)
  • Hard state messages, user profiles
  • Fine-grain optimistic replication
  • Soft state user map, mail map
  • Reconstructed after reconfiguration
  • Membership protocol
  • initiated when a crash is detected
  • Update of user map data structures
  • Broadcast of updated user map
  • Distributed disk scan
  • Each node scans local FS for msgs owned by
    moved users

Total amount of mail map info. that needs to be
recovered from disks is equal to that stored on
the crashed node -gt independent of cluster size
14
Strategy for reconfiguration (II)
15
Hard-state replication (I)
  • Internet semantics
  • Optimistic, eventually consistent replication
  • Per-message, per-user profile replication
  • Small window of inconsistency
  • A user may see stale data
  • Efficient during normal operation
  • For each request, a coordinator node pushes
    updates to other nodes
  • If another node crashes, the coordinator simply
    waits for its recovery to complete the update
  • But does not block
  • Coordinator crash
  • A replica node will detect it take over

16
Hard-state replication (II)
Less than linear degradation -gt Disk logging
overhead - can be reduced by using a separate
disk (or NVRAM)
17
Strategy for load balancing
  • Deciding where to store msgs
  • Spread soft limit on nodes per mailbox
  • This limit is violated when nodes crash
  • Select node from spread candidates
  • Small spread ? better data affinity
  • Smaller mail map data structure
  • More streamlined disk head movement
  • High spread ? better load balancing
  • More choices for selection
  • Load measure pending I/O operations

18
Handling heterogeneity
Better utilization of fast disks (x3 speed)
19
Inktomi
  • Derivative of the NOW project at UCB
  • Led to commercial services
  • TranSend, HotBot
  • The case against distributed systems ?
  • BASE semantics, instead of ACID
  • Basically available
  • Tolerate (occasionally) stale data
  • Soft state
  • Reconstructed during recovery
  • Eventually consistent
  • Responses can be approximate
  • Centralized work queues
  • Scalable !

20
Why not ACID ?
  • Much of the data in a network service can
    tolerate guarantees weaker than ACID
  • ACID makes no guarantees about availability
  • Indeed, it is preferable for an ACID service to
    be unavailable than to relax the ACID
    constraints
  • ACID is well suited for
  • Commerce Txs, billing users, maintainance of
    user profiles,
  • For most Internet information services, the users
    value availability more than strong consistency
    or durability
  • Web servers, search/aggregation servers,
    caching/transformation proxies,

21
Cluster architecture (I)
  • Front-ends
  • Supervision of incoming requests
  • Matching requests with profiles (customization
    DB)
  • Enqueue requests for service by one or more
    workers
  • Worker pool
  • Caches service-specific modules
  • Customization DB
  • Manager
  • Balancing load across workers, spawning more
    workers as load fluctuates or faults occur
  • System Area Network (SAN)
  • Graphical Monitor

22
Cluster architecture (II)
caches
Worker stub
23
Networked, commodity workstations
  • Incremental growth
  • Automated, centralized administration and
    monitoring.
  • Boot image management
  • Software load management
  • Firewall settings
  • Console access
  • Application configuration
  • Convenient for supporting a system
    componentconsider cost of decomposition versus
    efficiencies of SMP
  • Partial failure
  • Shared state
  • Distributed shared memory abstraction

24
3-layer decomposition
  • Service
  • User interface to control service
  • Device-specific presentation
  • Allow workers to remain stateless
  • TACC API
  • Transformation
  • Filtering, transcoding, re-rendering, encryption,
    compression
  • Aggregation
  • Composition (pipeline chaining) of stateless
    modules
  • Caching
  • Original, post-aggregation, post-transformation
    data
  • Customization
  • SNS Scalable Network Service support
  • Worker load balancing, overflow management
  • Fault-tolerance
  • System monitoring logging
  • Incremental scalability

25
Load management
  • Load balancing hints
  • Computed by Manager, based on load measurements
    from workers
  • Periodically transmitted to front-ends
  • Overflow pool
  • Absorb load bursts
  • Relatively rare, but prolonged
  • Eg Pathfinder landing on Mars -gt 220 M hits, in
    4 days
  • Spare machines on which the Manager can spawn
    workers on demand
  • Workers are interchangeable

26
Fault tolerance availability
  • Construct robust entities by relying on cached
    soft state, refreshed by periodic messages
  • Transient component failures are a fact of life
  • Process peer fault tolerance
  • When a component fails, one of its peers restarts
    it (possibly on a different node)
  • In the meantime, cached state (possibly stale) is
    still available to the surviving components
  • A restarted component gradually reconstructs its
    soft state
  • Typically by listening to multicasts from others
  • Not the same as process pair
  • Requires hard state
  • Timeouts to infer failures

27
The HotBot search engine
  • SPARCstations, interconnected via Myrinet
  • Front-ends
  • 50-80 threads per node
  • dynamic HTML (TCL script tags)
  • Load balancing
  • Static partition of search DB
  • Every query goes to all workers in parallel
  • Workers are not 100 interchangeable
  • Each worker has a local disk
  • Version 1 DB fragments are cross-mounted
  • So that other nodes can reach the data, with
    graceful performance degradation
  • Version 2 RAID
  • 26 nodes loss of 1 node resulted in the
    available DB dropping from 54 M documents to 51
    M
  • Informix DB for user profiles ad revenue
    tracking
  • Primary/backup failover

28
TranSend
  • Caching transformation proxy
  • SPARCstations interconnected via 10 Mb/s switched
    Ethernet dialup pool
  • Thread per TCP connection
  • Single front-end, with a total of 400 threads
  • Pipelining of distillers
  • Lossy-compression workers
  • Centralized Manager
  • Periodic IP multicast to announce its presence
  • No static binding is required for workers
  • Workers periodically report load metric
  • distillers queue length, weighted by a factor
    reflecting the expected execution cost
  • Version 1 process-pair recovery
  • Version 2 soft state with watcher process
    periodic beacon of state updates

29
Internet service workloads
  • Yahoo 625 M page views / day
  • HTML 7 KB, Images 10 KB
  • AOLs proxy 5.2 B requests / day
  • Response size 5.5 KB
  • Services often take 100s of millisecs
  • Responses take several seconds to flow back
  • High task throughput non-neglible latency
  • A service may have to sustain 1000s of
    simultaneous tasks
  • C10K problem
  • Human users 4 parallel HTTP/GET requests
    spawned per page view
  • A large of service tasks are independent of
    each other

30
DDS (S.D. Gribble)
  • Self-managing, cluster-based data repository
  • Seen by services as a conventional data structure
  • Log, tree, hash table
  • High performance
  • 60 K reads/sec, over 1.28 TB of data
  • 128-node cluster
  • The CAP principle
  • A system can have at most two of the following
    properties
  • Consistency
  • Availability
  • Tolerance to network Partitions

31
CAP trade-offs
32
SEDA (M. Welsh)
  • Staged, event-driven architecture
  • Service stages, linked via queues
  • Thread pool per stage
  • Massive concurrency
  • Admission priority control on each individual
    queue
  • Adaptive load balancing
  • Feedback loop
  • No a-priori resource limits

33
Overhead of concurrency (I)
34
Overhead of concurrency (II)
35
SEDA architecture (I)
36
SEDA architecture (II)
37
SEDA architecture (III)
38
Clusters for Internet Services
  • Previous observation (TACC, Inktomi, NOW)
  • Clusters of workstations are a natural platform
    for constructing Internet services
  • Internet service properties
  • support large, rapidly growing user populations
  • must remain highly available, and cost-effective
  • Clusters offer a tantalizing solution
  • incremental scalability cluster grows with
    service
  • natural parallelism high performance platform
  • software and hardware redundancy fault-tolerance

39
Software troubles
  • Internet service construction on clusters is hard
  • load balancing, process management,
    communications abstractions, I/O balancing,
    fail-over and restart,
  • toolkits proposed to help (TACC, AS1, River, )
  • Even harder if shared, persistent state is
    involved
  • data partitioning, replication, and consistency,
    interacting with storage subsystem,
  • solutions not geared to clustered services
  • use (distributed) RDBMS expensive, powerful
    semantic guarantees, generality at cost of
    performance
  • use network/distributed FS overly general, high
    overhead (e.g. double buffering penalties).
    Fault-tolerance?
  • Roll-your-own custom solution not reusable,
    complex

40
Idea / Hypothesis
  • It is possible to
  • isolate clustered services from vagaries of state
    mgmt.,
  • to do so with adequately general abstractions,
  • to build those abstractions in a layered fashion
    (reuse),
  • and to exploit clusters for performance, and
    simplicity.
  • Scalable Distributed Data Structures (SDDS)
  • take conventional data structure
  • hash table, tree, log,
  • partition it across nodes in a cluster
  • parallel access, scalability,
  • replicate partitions within replica groups in
    cluster
  • availability in face of failures, further
    parallelism
  • store replicas on disk

41
Why SDDS?
  • Fundamental software engineering principle
  • Separation of concerns
  • decouple persistency/consistency logic from rest
    of service
  • simpler (and cleaner!) service implementations
  • Service authors understand data structures
  • familiar behavior and interfaces from single-node
    case
  • should enable rapid development of new services
  • Structure access patterns are self-evident
  • access granularity manifestly a structure element
  • coincidence of logical and physical data units
  • cf. file systems, SQL in RDBMS, VM pages in DSM

42
SDDS Challenges
  • Overcoming complexities of distributed systems
  • data consistency, data distribution, request load
    balancing, hiding network latency and OS
    overhead,
  • ace up the sleeve cluster ? wide area
  • single, controlled administrative domain
  • engineer to (probabilistically) avoid network
    partitions
  • use low-latency, high-throughput SAN (5 µs,
    40-120 MB/s)
  • predictable behavior, controlled heterogeneity
  • I/O is still a problem
  • Plenty of work on fast network I/O
  • some on fast disk I/O
  • Less work bridging network ? disk I/O in cluster
    environment

Segment-based cluster I/O layer Filtered
streams bet. Disks, network, memory
43
Segment layer (motivation)
  • Its all about disk bandwidth avoiding seeks
  • 8 ms random seek, 25-80 MB/s throughput
  • must read 320 KB per seek to break even
  • Build disk abstraction layer based on segments
  • 1-2 MB regions on disk, read and written in their
    entirety
  • force upper layers to design with this in mind
  • small reads/writes treated as uncommon failure
    case
  • SAN throughput is comparable to disk throughput
  • Stream from disk to network saturate both
    channels
  • stream through service-specific filter functions
  • selection, transformation,
  • Apply lessons from high-performance networks

44
Taxonomy of Clustered Services
Stateless
Soft-state
Persistent State
  • high availability and
  • completeness
  • perhaps consistency
  • persistence necessary
  • high availability
  • perhaps consistency
  • persistence is an
  • optimization

State Mgmt. Requirements
Little or none
TACC distillers
TACC aggregators
Inktomi search engine
River modules
AS1 servents or RMX
Parallelisms
Examples
Scalable PIM apps
Video Gateway
Squid web cache
HINDE mint
45
Clustering
  • Goal
  • Take a cluster of commodity workstations make
    them look like a supercomputer.
  • Problems
  • Application structure
  • Partial failure management
  • Interconnect technology
  • System administration

46
Cluster Prehistory Tandem NonStop
  • Early (1974) foray into transparent fault
    tolerance through redundancy
  • Mirror everything (CPU, storage, power supplies)
  • can tolerate any single fault (later processor
    duplexing)
  • Hot standby process pair approach
  • Whats the difference between high availability
    fault tolerance?
  • Noteworthy
  • Shared nothing--why?
  • Performance and efficiency costs?
  • Later evolved into Tandem Himalaya
  • used clustering for both higher performance
    higher availability

47
Pre-NOW Clustering in the 90s
  • IBM Parallel Sysplex and DEC OpenVMS
  • Targeted at conservative (read mainframe)
    customers
  • Shared disks allowed under both (why?)
  • All devices have cluster-wide names (shared
    everything?)
  • 1500 installations of Sysplex, 25,000 of OpenVMS
    Cluster
  • Programming the clusters
  • All System/390 and VAX VMS subsystems were
    rewritten to be cluster-aware
  • OpenVMS cluster support exists even in
    single-node OS!
  • An advantage of locking into proprietary
    interfaces
  • What about fault tolerance?

48
The Case For NOW MPPs a Near Miss
  • Uniprocessor performance improves by 50 / yr
    (4/month)
  • 1 year lagWS 1.50 MPP node perf.
  • 2 year lagWS 2.25 MPP node perf.
  • No economy of scale in 100s gt
  • Software incompatibility (OS apps) gt
  • More efficient utilization of compute resources
  • statistical multiplexing
  • Scale makes availability affordable (Pfister)
  • Which of these do commodity clusters actually
    solve?

49
Philosophy Systems of Systems
  • Higher Order systems research
  • aggressively use off-the-shelf hardware OS
    software
  • Advantages
  • easier to track technological advances
  • less development time
  • easier to transfer technology (reduce lag)
  • New challenges (the case against NOW)
  • maintaining performance goals
  • system is changing underneath you
  • underlying system has other people's bugs
  • underlying system is poorly documented

50
Clusters Enhanced Standard Litany
  • Software engineering
  • Partial failure management
  • Incremental scalability
  • System administration
  • Heterogeneity
  • Hardware redundancy
  • Aggregate capacity
  • Incremental scalability
  • Absolute scalability
  • Price/performance sweet spot

51
Clustering Internet Services
  • Aggregate capacity
  • TB of disk storage, THz of compute power
  • If we only we could harness it in parallel!
  • Redundancy
  • Partial failure behavior only small fractional
    degradation from loss of one node
  • Availability industry average across large
    sites during 1998 holiday season was 97.2
    availability (source CyberAtlas)
  • Compare mission-critical systems have four
    nines (99.99)

52
Spike Absorption
  • Internet traffic is self-similar
  • Bursty at all granularities less than about 24
    hours
  • Whats bad about burstiness?
  • Spike Absorption
  • Diurnal variation
  • Peak vs. average demand typically a factor of 3
    or more
  • Starr Report CNN peaked at 20M hits/hour
    (compared to usual peak of 12M hits/hour thats
    66)
  • Really the holy grail capacity on demand
  • Is this realistic?

53
Diurnal Cycle (UCB dialups, Jan. 1997)
  • 750 modems at UC Berkeley
  • Instrumented early 1997

54
Clustering Internet Workloads
  • Internet vs. traditional workloads
  • e.g. Database workloads (TPC benchmarks)
  • e.g. traditional scientific codes (matrix
    multiply, simulated annealing and related
    simulations, etc.)
  • Some characteristic differences
  • Read mostly
  • Quality of service (best-effort vs. guarantees)
  • Task granularity
  • Embarrassingly parallel
  • but are they balanced?

55
Meeting the Cluster Challenges
  • Software programming models
  • Partial failure and application semantics
  • System administration

56
Software Challenges (I)
  • Message-passing Active Messages
  • Shared memory Network RAM
  • CC-NUMA, Software DSM
  • MP vs SM a long-standing religious debate
  • Arbitrary object migration (network
    transparency)
  • What are the problems with this?
  • Hints RPC, checkpointing, residual state

57
Software Challenges (II)
  • Real issue we have to think differently about
    programming
  • to harness clusters?
  • to get decent failure semantics?
  • to really exploit software modularity?
  • Traditional uniprocessor programming
    idioms/models dont seem to scale up to clusters
  • Question Is there a natural to use cluster
    model that scales down to uniprocessors?
  • If so, is it general or application-specific?
  • What would be the obstacles to adopting such a
    model?

58
Partial Failure Management
  • What does partial failure mean for
  • a transactional database?
  • A read-only database striped across cluster
    nodes?
  • A compute-intensive shared service?
  • What are appropriate partial failure
    abstractions?
  • Incomplete/imprecise results?
  • Longer latency?
  • What current programming idioms make partial
    failure hard?

59
Cluster System Administration (I)
  • Total cost of ownership (TCO) way high for
    clusters
  • Median sysadmin cost per machine per year (1996)
    700
  • Cost of a headless workstation today 1500
  • Previous Solutions
  • Pay someone to watch
  • Ignore or wait for someone to complain
  • Shell Scripts From Hell
  • not general ? vast repeated work
  • Need an extensible and scalable way to automate
    the gathering, analysis, and presentation of data

60
Cluster System Administration (II)
  • Extensible Scalable Monitoring For Clusters of
    Computers (Anderson Patterson, UC Berkeley)
  • Relational tables allow properties queries of
    interest to evolve as the cluster evolves
  • Extensive visualization support allows humans to
    make sense of masses of data
  • Multiple levels of caching decouple data
    collection from aggregation
  • Data updates can be pulled on demand or
    triggered by push

61
References
  • Y. Saito, B.N. Bershad and H.M. Levy,
    Manageability, availability and performance in
    Porcupine a highly scalable, cluster-based mail
    service, Proc. 17th ACM SOSP, 1999.
  • S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
    D. Culler, Scalable, distributed data
    structures for Internet service construction,
    Proc. 4th OSDI, 2000.
  • A. Fox, S.D. Gribble, Y. Chawathe, E.A. Brewer,
    P. Gauthier, Cluster-based scalable network
    services, Proc. 16th ACM SOSP, 1997.
Write a Comment
User Comments (0)
About PowerShow.com