Lessons from Giant-Scale Services IEEE Internet Computing, - PowerPoint PPT Presentation

About This Presentation
Title:

Lessons from Giant-Scale Services IEEE Internet Computing,

Description:

CNN. Instant messaging. Napster. Many more... The demand ... Look at the Basic Model of the giant-scale services. Focusing the challenges of ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 38
Provided by: nost
Category:

less

Transcript and Presenter's Notes

Title: Lessons from Giant-Scale Services IEEE Internet Computing,


1
Lessons from Giant-Scale ServicesIEEE Internet
Computing,  Vol. 5, No. 4., July/August 2001
  • Eric A. Brewer
  • University of California, Berkeley,
  • and Iktomi Corporation
  • ?a???s?as? ???a? ?s??a??da? (?484)

2
Examples of Giant-scale services
  • Aol
  • Microsoft network
  • Yahoo
  • eBay
  • CNN
  • Instant messaging
  • Napster
  • Many more

The demand They must be always available, despite
their scale, growth rate, rapid evolution of
content and features, etc
3
Article Characteristics
  • Characteristics
  • Experience article
  • No literature points
  • Principles approaches
  • Not quantitative evaluation
  • The reasons
  • Focusing on high level design
  • New area
  • Proprietary nature of the information

4
Article scope
  • Look at the Basic Model of the giant-scale
    services
  • Focusing the challenges of
  • High availability
  • Evolution
  • Growth
  • Principles for the above

Simplify the design of large systems
5
Basic Model (general)
  • The infrastructure services
  • Internet-based systems that provide instant
    messaging, wireless services
  • and so on

6
Basic Model (general)
  • We discuss
  • Single-site
  • Single-owner
  • Well-connected cluster
  • Perhaps a part of a larger service
  • We do not discuss
  • Wide are issues
  • Network partitioning
  • Low or discontinuous bandwidth
  • Multiple admistrative domains
  • Service monitoring
  • Network QoS
  • Security
  • Log and logging analysis
  • DBMS

7
Basic Model (general)
  • We focus on
  • High availability
  • Replication
  • Degradation
  • Disaster tolerance
  • Online evolution

The scope is bridging the gap between the basic
building block of giant-scale services and the
real world scalability and availability they
require
8
Basic Model (Advantages)
  • Access anywhere, anytime
  • Availability via multiple devices
  • Groupware support
  • Lower overall cost
  • Simplified service updates

9
Basic Model (Advantages)
  • Access anywhere, anytime
  • The infrastructure is ubiquitous
  • You can access the service from home, work
    airport and so on

10
Basic Model (Advantages)
  • Availability via multiple devices
  • The infrastructure handles the processing (the
    most at least)
  • User access the services via set-top boxes,
    networks computer, smart phones and so on
  • In that way we have offer more functionality for
    a given cost and battery life

11
Basic Model (Advantages)
  • Groupware support
  • Centralizing data from many users allowing
    group-ware application like
  • Calendar
  • Teleconferencing systems, and so on

12
Basic Model (Advantages)
  • Lower overall cost
  • Hard to measure overall cost but
  • Infrastructure services have an advantage over
    designs based on stand alone devices
  • High utilization
  • Centralize administration reduce the cost, but
    harder to quantify

13
Basic Model (Advantages)
  • Simplified service updates
  • Updates without physical distribution
  • The most powerful long term advantage

14
Basic Model (Components)
15
Basic Model (Assumptions)
  • The service provider has limited control over the
    clients an the IP network
  • Queries drive the service
  • Read only queries outnumber greatly update
    queries
  • Giant-scale services use CLUSTERS

16
Basic Model (Components)
  • Clients, such as Web browsers. Initiate the
    queries to the services
  • IP network, public Internet or a private network.
    Provides access to the service.
  • Load manager, provides indirection between the
    services external name and the servers physical
    names (IP addresses). Load balancing. Proxies or
    firewalls before the load manager. 
  • Servers. Combining CPU, memory, and disks into an
    easy-to-replicate unit.
  • Persistent data store, replicated or partitioned
    database spread across the servers. Optional
    external DBMSs or RAID storage. 
  • Backplane. Optional. Handles inter-server
    traffic.

17
Basic Model (Load Management)
  • Round Robin DNS
  • Layer-4 switches
  • understand TCP and port numbers
  • Layer-7 switches
  • parses URL
  • Custom front-end nodes
  • They act like service specific layer-7 routers
  • Include the clients in the load balancing
  • Ex alternative DNS or Name Server

18
Basic Model (Load Management)
  • Two opposite approaches
  • Simple Web Farm
  • Search engine cluster

19
Basic Model (Load Management)Simple Web Farm
20
Basic Model (Load Management)Search engine
cluster
21
High Availability (general)
  • Like telephone, rail or water systems
  • Features
  • Extreme symmetry
  • No people
  • Few cables
  • No external disks
  • No monitors
  • Inkotomi in addition
  • Manages the cluster offline
  • Limit temperature and power variations

22
High Availability (metrics)
23
High Availability (DQ principle)
  • The systems overall capacity has a particular
    physical bottleneck
  • Ex. Total I/O bandwidth, total seeks per second
  • Total amount of data to be moved per second
  • Measurable and tunable
  • Ex. adding nodes, software optimization OR faults

24
High Availability (DQ principle)
  • Focus on the relative DQ value, not on the
    absolute
  • Define the DQ value of your system
  • Normally DQ values scales linearly with the
    number of the nodes

25
High Availability (DQ principle)
  • Analyzing the faults impact
  • Focus on how DQ reduction influence the three
    metrics
  • Only for data-intensive sites

26
High AvailabilityReplication vs. Partitioning
Example 2-node cluster. One down
  • Replication
  • 100 harvest
  • 50 yield
  • DQ - 50
  • Maintain D
  • Reduce Q
  • Partitioning
  • 50 Harvest
  • 100 yield
  • DQ - 50
  • Reduce D
  • Maintain Q

27
High AvailabilityReplication
28
High AvailabilityReplication vs. Partitioning
  • Replication wins if the bandwidth is the same.
  • Extra cost is on the bandwidth not on the disks
  • Easy recovering
  • We might also use partial replication and
    randomization

29
High AvailabilityGraceful degradation
  • We can not avoid saturation, because
  • Peak-to-average ratio 1.61 to 61. Expensive to
    build capacity above the (normal) peak
  • Single events burst (ex. Online ticket sales for
    special events)
  • Faults like power failures or natural disaster
    affect substantially the overall DQ and the
    remaining nodes become saturated.

So, we MUST have mechanisms for degradation
30
High AvailabilityGraceful degradation
  • The DQ principle give us the options for
  • Limit Q (capacity) to maintain D
  • Reduce D and increase Q
  • Focus on harvest by Admission Control (AC)
  • Reduce Q
  • Reduce D on dynamic databases
  • Both
  • Cut the effective database to half (new approach)

31
High AvailabilityGraceful degradation
  • More sophisticated techniques
  • Cost based AC
  • Estimate query cost
  • Reduce the data per query
  • Augment Q
  • Priority (or value) based AC
  • Drop low-valued queries
  • Ex execute stock trade within 60s or the user
    pays no commission
  • Reduced data freshness
  • Reduce the freshness so reduce the work per query
  • Increase yield at the expense of harvest

32
High AvailabilityDisaster Tolerance
  • Combination of managing replicas and graceful
    degradation
  • How many locations?
  • How many replicas on each location?
  • Load management
  • Layer-4 switch do not help with the loss of a
    whole cluster
  • Smart clients is the solution

33
Online Evolution Growth
  • We must plan for continuous growth and frequent
    functionality updates
  • Maintenance and upgrades are controlled failures
  • Total loss of DQ value is
  • ?DQ n u average DQ/node DQ u
  • Where n is the number of nodes and u the total
    amount per time a node requires for online
    evolution

34
Online Evolution GrowthThree approaches
An example for a 4-node cluster
35
ConclusionsThe basic lessons learned
  • Get the basics right
  • Professional data center, layer-7 switch,
    symmetry
  • Decide on your availability metrics
  • Everyone must agree on the goals
  • Harvest and yield gt uptime
  • Focus on MTTR at least as much as MTBF
  • MTTR is easier and has the same impact
  • Understand load redirection during faults
  • Data replication is insufficient, you need excess
    DQ

36
ConclusionsThe basic lessons learned
  • Graceful degradation is a critical part
  • Intelligent admission control and dynamic
    database reduction
  • Use DQ analysis on all upgrades
  • Capacity planning
  • Automate upgrades as much as possible
  • Have a fast simple way to return to older version

37
Final Statement
  • Smart clients could simplify all of the above
Write a Comment
User Comments (0)
About PowerShow.com