CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

CS556: Distributed Systems

Description:

Case Studies of Scalable Systems: Porcupine, Inktomi (& a taste of SABRE) Fall Semester 2002 ... Porcupine. A highly available cluster-based mail service ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 39
Provided by: mar177
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Case Studies of Scalable Systems Porcupine,
Inktomi( a taste of SABRE)
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
Porcupine
  • A highly available cluster-based mail service
  • Built out of commodity H/W
  • Mail is important, hard, and easy ?
  • Real demand
  • AOL, HotMail gt100 M messages/day
  • Write-intensive, I/O bound with low locality
  • Simple API, inherent parallelism, weak
    consistency
  • Simpler that DDBMS or DFS
  • Scalability goals
  • Performance linearly scale with cluster size
  • Manageability automatic reconfigurations
  • Availability gracefully survive multiple failures

3
Conventional mail services
Static partitioning of mailboxes - on top of
a FS or DBMS
Performance problems - no dynamic load
balancing
Manageability problems -static, manual
partitioning decisions
Availability problems -If a server goes down,
part of the user population cannot access
their mailboxes
4
Functional Homogeneity
  • Any node can perform any task
  • Interaction with mail clients
  • Mail storage
  • Any piece of data can be managed at any node
  • Techniques
  • Replication reconfiguration
  • Gracefully survive failures
  • Load balancing
  • Masking of skews in workload cluster
    configuration
  • Dynamic task data placement decisions
  • Messages for a single user can be scattered on
    multiple nodes collected only upon request

5
Architecture
Protocol handling
User lookup
Load balancing
Message store access
6
Operation
  • Incoming request send msg to userX
  • DNS/RR selection of node (A)
  • Who manages userX ? (B)
  • A issues request to B for user verification
  • B knows where userXs messages are kept (C, D)
  • A picks best node for the new msg (D)
  • D stores the new msg

Each user is managed by a node and all nodes
must agree on who is managed where
Partitioning of user population using a hash
function
7
Strategy for scalable performance
  • Avoid creating hot spots
  • Partition data uniformly among nodes
  • Fine-grain data partition
  • Experimental results
  • 30-node cluster (PCs with Linux)
  • Synthetic load
  • derived from University servers log
  • Comparison with sendmailpopd
  • Sustains 800 msgs/sec (68 M msgs/day)
  • As compared to 250 msgs/sec (25 M msgs/day)

8
Strategy for reconfiguration (I)
  • Hard state messages, user profiles
  • Fine-grain optimistic replication
  • Soft state user map, mail map
  • Reconstructed after reconfiguration
  • Membership protocol
  • initiated when a crash is detected
  • Update of user map data structures
  • Broadcast of updated user map
  • Distributed disk scan
  • Each node scans local FS for msgs owned by
    moved users

Total amount of mail map info. that needs to be
recovered from disks is equal to that stored on
the crashed node -gt independent of cluster size
9
Strategy for reconfiguration (II)
10
Hard-state replication (I)
  • Internet semantics
  • Optimistic, eventually consistent replication
  • Per-message, per-user profile replication
  • Small window of inconsistency
  • A user may see stale data
  • Efficient during normal operation
  • For each request, a coordinator node pushes
    updates to other nodes
  • If another node crashes, the coordinator simply
    waits for its recovery to complete the update
  • But does not block
  • Coordinator crash
  • A replica node will detect it take over

11
Hard-state replication (II)
Less than linear degradation -gt Disk logging
overhead - can be reduced by using a separate
disk (or NVRAM)
12
Strategy for load balancing
  • Deciding where to store msgs
  • Spread soft limit on nodes per mailbox
  • This limit is violated when nodes crash
  • Select node from spread candidates
  • Small spread -gt better data affinity
  • Smaller mail map data structure
  • More streamlined disk head movement
  • High spread -gt better load balancing
  • More choices for selection
  • Load measure pending I/O operations

13
Handling heterogeneity
Better utilization of fast disks (x3 speed)
14
Inktomi
  • Derivative of the NOW project at UCB
  • Led to commercial services
  • TranSend, HotBot
  • The case against distributed systems ?
  • BASE semantics, instead of ACID
  • Basically available
  • Tolerate (occasionally) stale data
  • Soft state
  • Reconstructed during recovery
  • Eventually consistent
  • Responses can be approximate
  • Centralized work queues
  • Scalable !

15
Why not ACID ?
  • Much of the data in a network service can
    tolerate guarantees weaker than ACID
  • ACID makes no guarantees about availability
  • Indeed, it is preferable for an ACID service to
    be unavailable than to relax the ACID
    constraints
  • ACID is well suited for
  • Commerce Txs, billing users, maintainance of
    user profiles,
  • For most Internet information services, the users
    value availability more than strong consistency
    or durability
  • Web servers, search/aggregation servers,
    caching/transformation proxies,

16
Cluster architecture (I)
  • Front-ends
  • Supervision of incoming requests
  • Matching requests with profiles (customization
    DB)
  • Enqueue requests for service by one or more
    workers
  • Worker pool
  • Caches service-specific modules
  • Customization DB
  • Manager
  • Balancing load across workers, spawning more
    workers as load fluctuates or faults occur
  • System Area Network (SAN)
  • Graphical Monitor

17
Cluster architecture (II)
caches
Worker stub
18
Networked, commodity workstations
  • Incremental growth
  • Automated, centralized administration and
    monitoring.
  • Boot image management
  • Software load management
  • Firewall settings
  • Console access
  • Application configuration
  • Convenient for supporting a system
    componentconsider cost of decomposition versus
    efficiencies of SMP
  • Partial failure
  • Shared state
  • Distributed shared memory abstraction

19
3-layer decomposition
  • Service
  • User interface to control service
  • Device-specific presentation
  • Allow workers to remain stateless
  • TACC API
  • Transformation
  • Filtering, transcoding, re-rendering, encryption,
    compression
  • Aggregation
  • Composition (pipeline chaining) of stateless
    modules
  • Caching
  • Original, post-aggregation, post-transformation
    data
  • Customization
  • SNS Scalable Network Service support
  • Worker load balancing, overflow management
  • Fault-tolerance
  • System monitoring logging
  • Incremental scalability

20
Load management
  • Load balancing hints
  • Computed by Manager, based on load measurements
    from workers
  • Periodically transmitted to front-ends
  • Overflow pool
  • Absorb load bursts
  • Relatively rare, but prolonged
  • Eg Pathfinder landing on Mars -gt 220 M hits, in
    4 days
  • Spare machines on which the Manager can spawn
    workers on demand
  • Workers are interchangeable

21
Fault tolerance availability
  • Construct robust entities by relying on cached
    soft state, refreshed by periodic messages
  • Transient component failures are a fact of life
  • Process peer fault tolerance
  • When a component fails, one of its peers restarts
    it (possibly on a different node)
  • In the meantime, cached state (possibly stale) is
    still available to the surviving components
  • A restarted component gradually reconstructs its
    soft state
  • Typically by listening to multicasts from others
  • Not the same as process pair
  • Requires hard state
  • Timeouts to infer failures

22
The HotBot search engine
  • SPARCstations, interconnected via Myrinet
  • Front-ends
  • 50-80 threads per node
  • dynamic HTML (TCL script tags)
  • Load balancing
  • Static partition of search DB
  • Every query goes to all workers in parallel
  • Workers are not 100 interchangeable
  • Each worker has a local disk
  • Version 1 DB fragments are cross-mounted
  • So that other nodes can reach the data, with
    graceful performance degradation
  • Version 2 RAID
  • 26 nodes loss of 1 node resulted in the
    available DB dropping from 54 M documents to 51
    M
  • Informix DB for user profiles ad revenue
    tracking
  • Primary/backup failover

23
TranSend
  • Caching transformation proxy
  • SPARCstations interconnected via 10 Mb/s switched
    Ethernet dialup pool
  • Thread per TCP connection
  • Single front-end, with a total of 400 threads
  • Pipelining of distillers
  • Lossy-compression workers
  • Centralized Manager
  • Periodic IP multicast to announce its presence
  • No static binding is required for workers
  • Workers periodically report load metric
  • distillers queue length, weighted by a factor
    reflecting the expected execution cost
  • Version 1 process-pair recovery
  • Version 2 soft state with watcher process
    periodic beacon of state updates

24
Internet service workloads
  • Yahoo 625 M page views / day
  • HTML 7 KB, Images 10 KB
  • AOLs proxy 5.2 B requests / day
  • Response size 5.5 KB
  • Services often take 100s of millisecs
  • Responses take several seconds to flow back
  • High task throughput non-neglible latency
  • A service may have to sustain 1000s of
    simultaneous tasks
  • C10K problem
  • Human users 4 parallel HTTP/GET requests
    spawned per page view
  • A large of service tasks are independent of
    each other

25
DDS (S.D. Gribble)
  • Self-managing, cluster-based data repository
  • Seen by services as a conventional data structure
  • Log, tree, hash table
  • High performance
  • 60 K reads/sec, over 1.28 TB of data
  • 128-node cluster
  • The CAP principle
  • A system can have at most two of the following
    properties
  • Consistency
  • Availability
  • Tolerance to network Partitions

26
CAP trade-offs
27
SEDA (M. Welsh)
  • Staged, event-driven architecture
  • Service stages, linked via queues
  • Thread pool per stage
  • Massive concurrency
  • Admission priority control on each individual
    queue
  • Adaptive load balancing
  • Feedback loop
  • No a-priori resource limits

28
Overhead of concurrency (I)
29
Overhead of concurrency (II)
30
SEDA architecture (I)
31
SEDA architecture (II)
32
SEDA architecture (III)
33
SABRE
  • Reservations, inventory tracking
  • Airlines, travel agencies,
  • Descendant of CRS (Computerized Reservation
    System)
  • Hosted in a number of secure data centers
  • Connectivity with major reservation systems
  • Amadeus, Apollo, Galileo, WorldSpan,
  • Management of PNRs
  • Passenger Name Records
  • IATA/ARC standards for ticketing
  • 2001
  • 6 K employees, in 45 countries
  • 59 K travel agencies
  • 450 airlines, 53 K hotels
  • 54 car rental companies, 8 cruise lines, 33
    railroads
  • 228 tour operators

34
History (I)
  • 1964
  • Location New York
  • Hosted on 2 IBM 7090 mainframes
  • 84 K requests/day
  • Development cost 400 man-years, 40 M (USD)
  • 1972
  • Location Tulsa, Oklahoma
  • Hosted on IBM 360s
  • The switch caused 15 minutes of service
    interruption
  • 1976
  • 130 travel agencies have terminals
  • 1978
  • Storage of 1 M fares
  • 1984
  • Bargain Finder service

35
History (II)
  • 1985
  • easySabre PCs can connect to the system as
    terminals
  • 1986
  • Automated yield management system (dynamic
    pricing)
  • 1988
  • Storage of 36 M fares
  • Can be combined gt 1 B fare options
  • 1995
  • Initiation of Y2K code inspection
  • 200 M lines of code
  • Interfaces with gt600 suppliers
  • New software for gt40K travel agents
  • 1,200 H/W S/W systems
  • 1996
  • Travelocity.com
  • 1998
  • Joint venture with ABACUS International
  • 7,300 travel agencies, in 16 countries (Asia)
  • 2000

36
Legacy connectivity (I)
  • Connection-oriented comm. protocol (sessions)
  • ALC Airline Link Control protocol
  • Packet-switching
  • but not TCP/IP usually X.25
  • Requires special H/W (network card)
  • Gradual upgrades to Frame-Relay connectivity
  • Structured message interfaces
  • Emulation of 3270 terminal
  • Pre-defined form fields
  • Integration with other systems ?
  • screen scrapping code
  • Message Processors gateways that offer
    connectivity to clients that do not use supported
    terminals
  • Encapsulation of ALC over TCP/IP

37
Legacy connectivity (II)
There is a large market for gateway/connectivity
products - Eg www.datalex.com
38
References
  • Y. Saito, B.N. Bershad and H.M. Levy,
    Manageability, availability and performance in
    Porcupine a highly scalable, cluster-based mail
    service, Proc. 17th ACM SOSP, 1999.
  • S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
    D. Culler, Scalable, distributed data
    structures for Internet service construction,
    Proc. 4th OSDI, 2000.
  • A. Fox, S.D. Gribble, Y. Chawathe, E.A. Brewer,
    P. Gauthier, Cluster-based scalable network
    services, Proc. 16th ACM SOSP, 1997.
  • http//www.sabre.com
Write a Comment
User Comments (0)
About PowerShow.com