CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

CS556: Distributed Systems

Description:

Every query goes to all workers in parallel. Workers are not ... to build those abstractions in a layered fashion (reuse) ... an inversion of Yahoo! directory ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 56
Provided by: mar177
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Clusters for Internet Services
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
The HotBot search engine
  • SPARCstations, interconnected via Myrinet
  • Front-ends
  • 50-80 threads per node
  • dynamic HTML (TCL script tags)
  • Load balancing
  • Static partition of search DB
  • Every query goes to all workers in parallel
  • Workers are not 100 interchangeable
  • Each worker has a local disk
  • Version 1 DB fragments are cross-mounted
  • So that other nodes can reach the data, with
    graceful performance degradation
  • Version 2 RAID
  • 26 nodes loss of 1 node resulted in the
    available DB dropping from 54 M documents to 51
    M
  • Informix DB for user profiles ad revenue
    tracking
  • Primary/backup failover

3
Internet service workloads
  • Yahoo 625 M page views / day
  • HTML 7 KB, Images 10 KB
  • AOLs proxy 5.2 B requests / day
  • Response size 5.5 KB
  • Services often take 100s of millisecs
  • Responses take several seconds to flow back
  • High task throughput non-neglible latency
  • A service may have to sustain 1000s of
    simultaneous tasks
  • C10K problem
  • Human users 4 parallel HTTP/GET requests
    spawned per page view
  • A large of service tasks are independent of
    each other

4
Clustering Holy Grail
  • Goal
  • Take a cluster of commodity workstations make
    them look like a supercomputer.
  • Problems
  • Application structure
  • Partial failure management
  • Interconnect technology
  • System administration

5
Cluster Prehistory Tandem NonStop
  • Early (1974) foray into transparent fault
    tolerance through redundancy
  • Mirror everything (CPU, storage, power supplies)
  • can tolerate any single fault (later processor
    duplexing)
  • Hot standby process pair approach
  • Whats the difference between high availability
    fault tolerance?
  • Noteworthy
  • Shared nothing--why?
  • Performance and efficiency costs?
  • Later evolved into Tandem Himalaya
  • used clustering for both higher performance
    higher availability

6
Pre-NOW Clustering in the 90s
  • IBM Parallel Sysplex and DEC OpenVMS
  • Targeted at conservative (read mainframe)
    customers
  • Shared disks allowed under both (why?)
  • All devices have cluster-wide names (shared
    everything?)
  • 1500 installations of Sysplex, 25,000 of OpenVMS
    Cluster
  • Programming the clusters
  • All System/390 and/or VAX VMS subsystems were
    rewritten to be cluster-aware
  • OpenVMS cluster support exists even in
    single-node OS!
  • An advantage of locking into proprietary
    interfaces
  • What about fault tolerance?

7
The Case For NOW MPPs a Near Miss
  • Uniprocessor performance improves by 50 / yr
    (4/month)
  • 1 year lagWS 1.50 MPP node perf.
  • 2 year lagWS 2.25 MPP node perf.
  • No economy of scale in 100s gt
  • Software incompatibility (OS apps) gt
  • More efficient utilization of compute resources
  • statistical multiplexing
  • Scale makes availability affordable (Pfister)
  • Which of these do commodity clusters actually
    solve?

8
Philosophy Systems of Systems
  • Higher Order systems research
  • aggressively use off-the-shelf hardware OS
    software
  • Advantages
  • easier to track technological advances
  • less development time
  • easier to transfer technology (reduce lag)
  • New challenges (the case against NOW)
  • maintaining performance goals
  • system is changing underneath you
  • underlying system has other people's bugs
  • underlying system is poorly documented

9
Clusters Enhanced Standard Litany
  • Software engineering
  • Partial failure management
  • Incremental scalability
  • System administration
  • Heterogeneity
  • Hardware redundancy
  • Aggregate capacity
  • Incremental scalability
  • Absolute scalability
  • Price/performance sweet spot

10
Clustering Internet Services
  • Aggregate capacity
  • TB of disk storage, THz of compute power
  • If we only we could harness it in parallel!
  • Redundancy
  • Partial failure behavior only small fractional
    degradation from loss of one node
  • Availability industry average across large
    sites during 1998 holiday season was 97.2
    availability (source CyberAtlas)
  • Compare mission-critical systems have four
    nines (99.99)

11
Spike Absorption
  • Internet traffic is self-similar
  • Bursty at all granularities less than about 24
    hours
  • Whats bad about burstiness?
  • Spike Absorption
  • Diurnal variation
  • Peak vs. average demand typically a factor of 3
    or more
  • Starr Report CNN peaked at 20M hits/hour
    (compared to usual peak of 12M hits/hour thats
    66)
  • Really the holy grail capacity on demand
  • Is this realistic?

12
Diurnal Cycle (UCB dialups, Jan. 1997)
  • 750 modems at UC Berkeley
  • Instrumented early 1997

13
Clustering Internet Workloads
  • Internet vs. traditional workloads
  • e.g. Database workloads (TPC benchmarks)
  • e.g. traditional scientific codes (matrix
    multiply, simulated annealing and related
    simulations, etc.)
  • Some characteristic differences
  • Read mostly
  • Quality of service (best-effort vs. guarantees)
  • Task granularity
  • Embarrassingly parallel
  • but are they balanced? (well return to this
    later)

14
Meeting the Cluster Challenges
  • Software programming models
  • Partial failure and application semantics
  • System administration

15
Software Challenges (I)
  • Message-passing Active Messages
  • Shared memory Network RAM
  • CC-NUMA, Software DSM
  • MP vs SM a long-standing religious debate
  • Arbitrary object migration (network
    transparency)
  • What are the problems with this?
  • Hints RPC, checkpointing, residual state

16
Partial Failure Management
  • What does partial failure mean for
  • a transactional database?
  • A read-only database striped across cluster
    nodes?
  • A compute-intensive shared service?
  • What are appropriate partial failure
    abstractions?
  • Incomplete/imprecise results?
  • Longer latency?
  • What current programming idioms make partial
    failure hard?

17
Software Challenges (II)
  • Real issue we have to think differently about
    programming
  • to harness clusters?
  • to get decent failure semantics?
  • to really exploit software modularity?
  • Traditional uniprocessor programming
    idioms/models dont seem to scale up to clusters
  • Question Is there a natural to use cluster
    model that scales down to uniprocessors?
  • If so, is it general or application-specific?
  • What would be the obstacles to adopting such a
    model?

18
Cluster System Administration (I)
  • Total cost of ownership (TCO) way high for
    clusters
  • Median sysadmin cost per machine per year (1996)
    700
  • Cost of a headless workstation today 1500
  • Previous Solutions
  • Pay someone to watch
  • Ignore or wait for someone to complain
  • Shell Scripts From Hell
  • not general ? vast repeated work
  • Need an extensible and scalable way to automate
    the gathering, analysis, and presentation of data

19
Cluster System Administration (II)
  • Extensible Scalable Monitoring For Clusters of
    Computers (Anderson Patterson, UC Berkeley)
  • Relational tables allow properties queries of
    interest to evolve as the cluster evolves
  • Extensive visualization support allows humans to
    make sense of masses of data
  • Multiple levels of caching decouple data
    collection from aggregation
  • Data updates can be pulled on demand or
    triggered by push

20
Visualizing Data Example
  • Display aggregates of various interesting machine
    properties on the NOWs
  • Note use of aggregation color

21
SDDS (S.D. Gribble)
  • Self-managing, cluster-based data repository
  • Seen by services as a conventional data structure
  • Log, tree, hash table
  • High performance
  • 60 K reads/sec, over 1.28 TB of data
  • 128-node cluster
  • The CAP principle
  • A system can have at most two of the following
    properties
  • Consistency
  • Availability
  • Tolerance to network Partitions

22
CAP trade-offs
23
Clusters for Internet Services
  • Previous observation (TACC, Inktomi, NOW)
  • Clusters of workstations are a natural platform
    for constructing Internet services
  • Internet service properties
  • support large, rapidly growing user populations
  • must remain highly available, and cost-effective
  • Clusters offer a tantalizing solution
  • incremental scalability cluster grows with
    service
  • natural parallelism high performance platform
  • software and hardware redundancy fault-tolerance

24
Software troubles
  • Internet service construction on clusters is hard
  • load balancing, process management,
    communications abstractions, I/O balancing,
    fail-over and restart,
  • toolkits proposed to help (TACC, AS1, River, )
  • Even harder if shared, persistent state is
    involved
  • data partitioning, replication, and consistency,
    interacting with storage subsystem,
  • solutions not geared to clustered services
  • use (distributed) RDBMS expensive, powerful
    semantic guarantees, generality at cost of
    performance
  • use network/distributed FS overly general, high
    overhead (e.g. double buffering penalties).
    Fault-tolerance?
  • Roll-your-own custom solution not reusable,
    complex

25
Idea / Hypothesis
  • It is possible to
  • isolate clustered services from vagaries of state
    mgmt.,
  • to do so with adequately general abstractions,
  • to build those abstractions in a layered fashion
    (reuse),
  • and to exploit clusters for performance, and
    simplicity.
  • Scalable Distributed Data Structures (SDDS)
  • take conventional data structure
  • hash table, tree, log,
  • partition it across nodes in a cluster
  • parallel access, scalability,
  • replicate partitions within replica groups in
    cluster
  • availability in face of failures, further
    parallelism
  • store replicas on disk

26
Why SDDS?
  • Fundamental software engineering principle
  • Separation of concerns
  • decouple persistency/consistency logic from rest
    of service
  • simpler (and cleaner!) service implementations
  • Service authors understand data structures
  • familiar behavior and interfaces from single-node
    case
  • should enable rapid development of new services
  • Structure access patterns are self-evident
  • access granularity manifestly a structure element
  • coincidence of logical and physical data units
  • cf. file systems, SQL in RDBMS, VM pages in DSM

27
SDDS Challenges
  • Overcoming complexities of distributed systems
  • data consistency, data distribution, request load
    balancing, hiding network latency and OS
    overhead,
  • ace up the sleeve cluster ? wide area
  • single, controlled administrative domain
  • engineer to (probabilistically) avoid network
    partitions
  • use low-latency, high-throughput SAN (5 µs,
    40-120 MB/s)
  • predictable behavior, controlled heterogeneity
  • I/O is still a problem
  • Plenty of work on fast network I/O
  • some on fast disk I/O
  • Less work bridging network ? disk I/O in cluster
    environment

Segment-based cluster I/O layer Filtered
streams bet. Disks, network, memory
28
Prototype hash table
  • Storage bricks provide local, network-accessible
    hash tables
  • Interaction with distrib. hash table through
    abstraction libraries
  • C, Java APIs available
  • partitioning, mirrored replication logic in each
    library
  • Distrib. table semantics
  • handles node failures
  • no consistency
  • or transactions, on-line recovery, etc.

29
Storage bricks
Argument marshalling
Worker pool one thread dispatched per request.
Local hash table implementations
Messaging, event queue
MMAP region management, and alloc(), free() impl.
Transport specific comm. and naming
storage brick
  • Individual nodes are storage bricks
  • consistent, atomic, network accessible
    operations on a local hash table
  • uses MMAP to handle data persistence
  • no transaction support
  • Clients communicate to set of storage bricks
    using RPC marshalling layer

Service application logic
Virtual to physical node names, inter-node hashing
Service Frontend
30
Parallelisms service
  • Provides relevant site information given a URL
  • an inversion of Yahoo! directory
  • Parallelisms builds index of all URLs, returns
    other URLs in same topics
  • read-mostly traffic, nearly no consistency
    requirements
  • large database of URLs
  • 1 GB of space for 1.5 million URLs and 80000
    topics
  • Service FE itself is very simple
  • 400 semicolons of C
  • 130 for app-specific logic
  • 270 for threads, HTTP munging,
  • hash table code 4K semicolons of C

http//ninja.cs.berkeley.edu/demos/ paralllelisms
/parallelisms.html
31
Lessons Learned (I)
  • mmap() simplified implementation, but at a price
  • service working sets naturally apply
  • No pointers breaks usual linked list and hash
    table libraries
  • Little control over the order of writes, so
    cannot guarantee consistency if crashes occur
  • If node goes down, may incur a lengthy sync
    before restart
  • Same for abstraction libraries simplicity with a
    cost
  • Each storage brick could be totally independent
  • because policy is embedded in abstraction
    libraries
  • Bad for administration monitoring
  • No place to hook in to get view of complete
    table
  • Each client makes isolated decisions
  • load balancing and failure detection

32
Lessons Learned (II)
  • Service simplicity premise seems valid
  • Parallelisms service code devoid of persistence
    logic
  • Parallelisms front-ends contain only session
    state
  • No recovery necessary if they fail
  • Interface selection is critical
  • Originally, just supported put(), get(), remove()
  • Wanted to support java.util.hashtable subclass
  • Needed enumerations, containsKey(),
    containsObject()
  • Significant re-plumbing required to efficiently
    support these !
  • Thread subsystem was troublesome
  • JDK has its own, and it conflicted. Had to
    remove threads from client-side abstraction
    library.

33
SDDS goal simplicity
  • Hypothesis simplify construction of services
  • evidence Parallelisms
  • distributed hash table prototype 3000 lines of
    C code
  • service 400 lines of C code, 1/3 of which is
    service-specific
  • evidence Keiretsu service
  • instant messaging service between heterogeneous
    devices
  • crux of service is in sharing of binding/routing
    state
  • original 131 lines of Java SDDS version 80
    lines of Java
  • Management/operational aspects
  • To be successful, authors must want to adopt
    SDDSs
  • simple to incorporate and understand
  • operational management must be nearly transparent
  • node fail-over and recovery, logging, etc. behind
    the scenes
  • plug-n-play extensibility to add capacity

34
SDDS goal generality
  • Potential criticism of SDDSs
  • No matter which structures you provide, some
    services simply cant be built with only those
    primitives
  • response pick a basis to enable many interesting
    services
  • Log, Hash Table, and Tree our guess at a good
    basis
  • Layered-model will allow people to develop other
    SDDSs
  • allow GiST-style specialization hooks?

35
SDDS Ideas on Consistency (I)
  • Consistency / performance tradeoffs
  • stricter consistency requirements imply worse
    performance
  • we know some intended services have weaker
    requirements
  • Rejected alternatives
  • built strict consistency, and force people to use
  • investigate extended transaction models
  • SDDS choice
  • Pick a small set of consistency guarantees
  • level 0 (atomic but not isolated operations)
  • level 3 (ACID)

36
SDDS Ideas on Consistency (II)
  • Replica management
  • what mechanism will we use between replicas?
  • 2 phase commit for distributed atomicity
  • log-based on-line recovery
  • Exploiting cluster properties
  • Low network latency ? fast 2 phase commit
  • especially relative to WAN latency for Internet
    services
  • Given good UPS, node failures are independent
  • commit to memory of peer in group, not to disk
  • (probabilistically) engineer away network
    partitions
  • unavailable ? failure
  • therefore consensus algorithm not needed

37
SDDS Ideas on load management
  • Data distribution affects request distribution
  • Start simple static data distribution
  • Given request, lookup or hash to determine
    partition
  • Optimizations
  • locality aware request dist. (LARD) within
    replicas
  • if no failures, replicas further partition data
    in memory
  • front ends often colocated with storage nodes
  • front end selection based on data distribution
    knowledge
  • smart clients (Ninja redirector stubs..?)
  • Issues
  • graceful degradation RED/LRP techniques to drop
    requests
  • given many simultaneous requests, what should be
    the service ordering policy?

38
Incremental Scalability (I)
  • Logs and trees have a natural solution
  • pointers are ingrained in these structures
  • use the pointers to (re)direct structures onto
    new nodes

39
Incremental Scalability (II)
  • Hash table is the tricky one !
  • Why? mapping is done by client-side hash
    functions
  • Unless table is chained, no pointers inside hash
    structure
  • Need to change client-side functions to scale
    structure
  • Litwins linear hashing?
  • client-side hash function evolves over time
  • clients independently discovery when to evolve
    functions
  • Directory-based map?
  • move hashing into infrastructure (inefficient)
  • or, have infrastructure inform clients when to
    change function
  • AFS-style registration and callbacks?

40
Getting the Interfaces Right
  • Upper interfaces sufficient generality
  • setting the bar for functionality (e.g.
    java.util.hashtable)
  • opportunity reuse of existing software (e.g.
    Berkeley DB)
  • Lower interfaces use a segment-based I/O layer?
  • Log, tree natural sequentiality, segments make
    sense
  • Hash table is much more challenging
  • Aggregating small, random accesses into large,
    sequential ones
  • Rely on commits to other nodes memory
  • periodically dump deltas to disk LFS-style

41
Evaluation use real services
  • Metrics for success
  • 1) measurable reduction in complexity to author
    Internet svcs.
  • 2) widespread adoption of SDDS by Ninja
    researchers
  • 1) Port/reimplement existing Ninja services
  • Keiretsu, Ninja Jukebox, the multispace Log
    service
  • explicitly demonstrate code reduction
    performance boon
  • 2) Convince people to use SDDS for new services
  • NinjaMail, Service Discovery Service, ICEBERG
    services
  • Challenge operational aspects of SDDS
  • goal as simple to use SDDS as single-node,
    non-persistent case

42
Segment layer (motivation)
  • Its all about disk bandwidth avoiding seeks
  • 8 ms random seek, 25-80 MB/s throughput
  • must read 320 KB per seek to break even
  • Build disk abstraction layer based on segments
  • 1-2 MB regions on disk, read and written in their
    entirety
  • force upper layers to design with this in mind
  • small reads/writes treated as uncommon failure
    case
  • SAN throughput is comparable to disk throughput
  • Stream from disk to network saturate both
    channels
  • stream through service-specific filter functions
  • selection, transformation,
  • Apply lessons from high-performance networks

43
Segment layer challenges
  • Thread event model
  • Lowest level model dictates entire application
    stack
  • dependency on particular thread subsystem is
    undesirable
  • Asynchronous interfaces are essential
  • especially for Internet services w/ thousands of
    connections
  • Potential model VIA completion queues
  • Reusability for many components
  • toughest customer Telegraph DB
  • dictate write ordering, be able to roll back
    mods for aborts
  • if content is paged, make sure dont overwrite on
    disk
  • no mmap( ) !

44
Segment Implementation Plan
  • Two versions planned
  • One version using POSIX syscalls and vanilla
    filesystem
  • definitely wont perform well (copies to handle
    shadowing)
  • portable to many platforms
  • good for prototyping and getting API right
  • One version on Linux with kernel modules for
    specialization
  • I/O-lite style buffer unification
  • use VIA or AM for network I/O
  • modify VM subsystem for copy-on-write segments,
    and/or paging dirty data to separate region

45
Related work (I)
  • (S)DSM
  • structural element is a better atomic unit than
    page
  • fault tolerance as goal
  • Distributed/networked FS NFS, AFS, xFS, LFS,
    ..
  • FS more general, has less chance to exploit
    structure
  • often not in clustered environment (except xFS,
    Frangipani)
  • Litwin SDDS LH, LH, RP, RP
  • significant overlap in goals
  • but little implementation experience
  • little exploitation of cluster characteristics
  • consistency model not clear

46
Related Work (II)
  • Distributed Parallel Databases R, Mariposa,
    Gamma,
  • different goal (generality in structure/queries,
    xacts)
  • stronger and richer semantics, but at cost
  • both and performance
  • Fast I/O research U-Net, AM, VIA, IO-lite,
    fbufs, x-kernel,
  • network and disk subsystems
  • main results get OS out of way, avoid
    (unnecessary) copies
  • use results in our fast I/O layer
  • Cluster platforms TACC, AS1, River, Beowulf,
    Glunix,
  • harvesting idle resources, process migration,
    single-system view

47
Taxonomy of Clustered Services
Stateless
Soft-state
Persistent State
  • high availability and
  • completeness
  • perhaps consistency
  • persistence necessary
  • high availability
  • perhaps consistency
  • persistence is an
  • optimization

State Mgmt. Requirements
Little or none
TACC distillers
TACC aggregators
Inktomi search engine
River modules
AS1 servents or RMX
Parallelisms
Examples
Scalable PIM apps
Video Gateway
Squid web cache
HINDE mint
48
Performance
  • Bulk-loading of database dominated by disk access
    time
  • Can achieve 1500 inserts per second per node on
    100 Mb/s Ethernet cluster, if hash table fits in
    memory (dominant cost is messaging layer)
  • Otherwise, degrades to about 30 inserts per
    second (dominant cost is disk write time)
  • In steady state, all nodes operate primarily out
    of memory, as the working set is fully paged in
  • similar principle to research Inktomi cluster
  • handles hundreds of queries per s. on 4 node
    cluster w/ 2 FEs

49
SEDA (M. Welsh)
  • Staged, event-driven architecture
  • Service stages, linked via queues
  • Thread pool per stage
  • Massive concurrency
  • Admission priority control on each individual
    queue
  • Adaptive load balancing
  • Feedback loop
  • No a-priori resource limits

50
Overhead of concurrency (I)
51
Overhead of concurrency (II)
52
SEDA architecture (I)
53
SEDA architecture (II)
54
SEDA architecture (III)
55
References
  • S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
    D. Culler, Scalable, distributed data
    structures for Internet service construction,
    Proc. 4th OSDI, 2000.
  • M. Welsh, , Proc. SOSP, 2001
Write a Comment
User Comments (0)
About PowerShow.com