Yasushi Saito - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Yasushi Saito

Description:

Functionally Homogeneous Clustering: A New Architecture for Scalable ... Functional homogeneous clustering: Dynamic data and function distribution. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 63
Provided by: Yasu6
Category:

less

Transcript and Presenter's Notes

Title: Yasushi Saito


1
Functionally Homogeneous Clustering A New
Architecture for Scalable Data-intensive
Internet Services
  • Yasushi Saito
  • yasushi_at_cs.washington.edu

University of Washington Department of Computer
Science and Engineering, Seattle, WA
2
Goals
  • Use cheap, unreliable hardware components to
    build scalable data-intensive Internet services.
  • Data-intensive Internet services email, BBS,
    calendar, etc..
  • Three facets of scalability ...
  • Performance linear increase with system size.
  • Manageability react to changes automatically.
  • Availability survive failures gracefully.

3
Contributions
  • Functional homogeneous clustering
  • Dynamic data and function distribution.
  • Exploitation of application semantics.
  • Three techniques
  • Naming and automatic recovery.
  • High-throughput optimistic replication.
  • Load balancing.
  • Email as the first target application.
  • Evaluation of the architecture using Porcupine.

4
Presentation Outline
  • Introduction.
  • What are Internet data-intensive services?
  • Existing solutions and their problems.
  • Functional homogeneous clustering.
  • Challenges and solutions.
  • Performance scaling.
  • Reacting to failures and recoveries.
  • Deciding on data placement.
  • Conclusion.

5
Data-intensive Internet services
  • Examples Email, Usenet, BBS, calendar, Internet
    collaboration (photobook, equill.com, crit.org).
  • Growing rapidly as demand for personal services
    grows.
  • High update frequency
  • Low access locality.
  • Web techniques (caching, stateless data
    transformation) not effective.
  • Weak data consistency requirements.
  • Well-defined structured data access path.
  • Embarrassingly parallel.
  • ? RDB overkill.

6
Rationale for Email
  • Email as the first target application.
  • Most important among data-intensive services.
  • Service concentration (Hotmail, AOL,, ...).
  • ? Practical demands.
  • The most update-intensive.
  • No access locality.
  • ? Challenging application.
  • Prototype implementation
  • Porcupine email server.

7
Conventional SolutionsBig Iron
  • Just buy a big machine
  • Easy deployment
  • Easy management
  • Limited scalability
  • Single failure domain
  • - Really expensive

8
Conventional SolutionsClustering
  • Connect many small machines
  • Cheap
  • Incremental scalability
  • Natural failure boundary

- Software managerial complexity.
9
Existing Cluster Solutions
  • Static partitioning assign data function to
    nodes statically.
  • Management problems
  • Manual data partition.
  • Performance problems
  • No dynamic load balancing.
  • Availability problems
  • Limited fault tolerance.

10
Presentation Outline
  • Introduction
  • Functionally homogeneous clustering
  • Key concepts.
  • Key techniques recovery, replication, load
    balancing.
  • Basic operations and data structures.
  • Challenges and solutions
  • Evaluation
  • Conclusion

11
Functionally Homogeneous Clustering
  • Clustering is the way to go.
  • Static function and data partitioning leads to
    the problems.
  • So, make everything dynamic
  • Any node can handle any task (client interaction,
    user management, etc).
  • Any node can store any piece of data (email
    messages, user profile).

12
Advantages
  • Advantages
  • Better load balance, hot spot dispersion.
  • Support for heterogeneous clusters.
  • Automatic reconfiguration and task
    re-distribution upon node failure/recovery. Easy
    node addition/retirement.
  • Results
  • Better Performance.
  • Better Manageability.
  • Better Availability.

13
Challenges
  • Dynamic function distribution
  • Solution run every function on every node.
  • Dynamic data distribution
  • How are data named and located?
  • How are data placed?
  • How do data survive failures?

14
Key Techniques and Relationships
Functional Homogeneity
Framework
Load Balancing
Name DB w/ Reconfiguration
Techniques
Replication
Goals
Manageability
Performance
Availability
15
Overview Porcupine
Replication Manager
Mail map
Email msgs
User profile
16
Receiving Email in Porcupine
Protocol handling
User lookup
Load Balancing
Data store (replication)
C
D
A
DNS-RR selection
1. send mail to bob
4. OK, bob has msgs on C, D, E
7. Store msg
3. Verify bob
6. Store msg
...
...
A
B
C
D
B
5. Pick the best nodes to store new msg ? C,D
2. Who manages bob? ? A
17
Basic Data Structures
bob
hash(bob) 2
User map
B
C
A
C
A
B
A
C
B
C
A
C
A
B
A
C
Mail map / user profile
bob A,C
suzy A,C
joe B
ann B
Bobs MSGs
Suzys MSGs
Bobs MSGs
Joes MSGs
Anns MSGs
Suzys MSGs
Mailbox storage
A
B
C
18
Presentation Outline
  • Overview
  • Functionally homogeneous clustering
  • Challenges and solutions
  • Scaling performance
  • Reacting to failures and recoveries
  • Recovering name space
  • Replicating of on-disk data
  • Load balancing
  • Evaluation
  • Conclusion

19
Scaling Performance
  • User map distributes user management
    responsibility evenly to nodes.
  • Load balancing distributes data storage
    responsibility evenly to nodes.
  • Workload is very parallel.
  • ? Scalable performance.

20
Measurement Environment
  • Porcupine email server.
  • Linux-2.2.7glibc-2.1.1ext2.
  • 50,000 lines of C code.
  • 30 node cluster of not-quite-all-identical PCs.
  • 100Mb/s Ethernet 1Gb/s hubs.
  • Performance disk-bound..
  • Homogeneous configuration.
  • Synthetic load
  • Modeled after UWCSE server.
  • Mixture of SMTP and POP sessions.

21
Porcupine Performance
POP performance, no email replication
68m/day
25m/day
22
Presentation Outline
  • Overview
  • Functionally homogeneous clustering
  • Challenges and solutions
  • Scaling performance
  • Reacting to failures and recoveries
  • Recovering name space
  • Replicating of on-disk data
  • Load balancing
  • Evaluation
  • Conclusion

23
How Do Computers Fail?
  • Large clusters are unreliable.
  • Assumption live nodes respond correctly in
    bounded time time most of the time.
  • Network can partition
  • Nodes can become very slow temporarily.
  • Nodes can fail (and may never recover).
  • Byzantine failures excluded.

24
Recovery Goals and Strategies
  • Goals
  • Maintain function after unusual failures.
  • React to changes quickly.
  • Graceful performance degradation / improvement.
  • Strategy Two complementary mechanisms.
  • Make data soft as much as possible.
  • Hard state email messages, user profile.
  • ? Optimistic fine-grain replication.
  • Soft state user map, mail map.
  • ? Reconstruction after configuration change.

25
Soft-state Recovery Overview
2. Distributed disk scan
1. Membership protocol Usermap recomputation
B
A
A
B
A
B
A
B
B
A
A
B
A
B
A
B
A
bob A,C
bob A,C
bob A,C
suzy A,B
suzy
B
A
A
B
A
B
A
B
B
A
A
B
A
B
A
B
B
joe C
joe C
joe C
ann B
ann
suzy A,B
C
suzy A,B
suzy A,B
ann B,C
ann B,C
ann B,C
Timeline
26
Cost of Soft State Recovery
  • Data bucketing allows fast discovery.
  • Cost of a bucket scan
  • ? O(U).
  • of buckets scanned.
  • ? O(1/N).
  • Freq. of changes.
  • ? O(N/MTBF).
  • Total cost.
  • O(U/MTBF) .

U5 million per node
27
How does Porcupine React to Configuration Changes?
See breakdown
28
Soft state recovery Summary
  • Scalable reliable recovery.
  • Quick, constant-cost recovery.
  • Recover soft state after any type/number of
    failures.
  • No residual references to dead nodes.
  • Proven correct
  • Soft state will eventually and correctly reflect
    the contents in the disk.

29
Replicating Hard State
  • Goals
  • Keep serving hard state (email msgs, user
    profile) after unusual failures.
  • Per-object site replica-site selection.
  • Space- and computational- efficiency.
  • Dynamic addition/removal of replicas.
  • Strategy Exploit application semantics.
  • Be optimistic.
  • Whole state transfer Thomas write rule.

30
Example Update Propagation
Object contents
Retire 310pm
310pm
A
310pm
Ack 310pm
C
A
A
Replica set
A
B
C
A
B
C
B
A
B
C
31
Example Update Propagation
Object contents
A
C
Replica set
A
B
C
A
B
C
B
310pm
310pm
A
A
C
A
B
C
Timestamp
Ack set
32
Example Update Propagation
A
C
A
B
C
A
B
C
B
310pm
310pm
A
A
C
A
B
C
310pm
A
B
33
Example Update Propagation
A
C
A
B
C
A
B
C
B
310pm
310pm
A
A
B
C
A
C
A
B
C
310pm
A
B
34
Example Update Propagation
A
Ack 310pm
C
A
B
C
A
B
C
B
310pm
310pm
A
A
B
C
A
C
A
B
C
310pm
A
B
Ack 310pm
35
Example Update Propagation
A
C
A
B
C
A
B
C
B
310pm
310pm
A
B
C
A
C
A
B
C
310pm
A
B
36
Example Update Propagation
Retire 310pm
A
C
A
B
C
A
B
C
B
310pm
310pm
A
B
C
A
C
A
B
C
310pm
A
B
Retire 310pm
37
Example Update Propagation
A
C
A
B
C
A
B
C
B
A
B
C
38
Replica Addition and Removal
A issues an update to delete C
A
B
A
B
C
B
  • Unified treatment of updates to contents and
    to replica-set.

310pm
A
B
A
B
C
A
B
C
C
A
A
New replica set
Target set
Ack set
39
What If Updates Conflict?
  • Apply Thomas write rule.
  • Newest update always wins.
  • Older update canceled by overwriting by the newer
    update.
  • Same rule applied to replica addition/deletion.
  • But some subtleties...

40
Node Discovery Protocol
D
A
B
A
B
C
320pm
310pm
A
B
D
A
B
C
A
B
C
D
C
A
B
A
B
D
A
B
C
D
B
A
310pm
320pm
Apply 320 update
A
B
A
B
C
D
A
B
C
A
B
D
C
Add targets C
A
B
A
B
New replica set
Target set
Ack set
41
Replication Space Overhead
Spool size2GB, Avg email msg4.7KB
42
How Efficient is Replication?
43
Replication Summary
  • Flexibility
  • Allow any object to be stored on any node.
  • Support dynamic replica-set changes.
  • Simplicity efficiency
  • Two-phase propagation/retirement.
  • Unifying contents- and replica-set updates.
  • Proven correct
  • All live replicas agree on the newest contents,
    regardless of of concurrent updates and of
    failures.
  • When network do not partition for long period.

44
Presentation Outline
  • Overview
  • Functionally homogeneous clustering
  • Challenges and solutions
  • Reacting to failures and recoveries
  • Soft state namespace recovery
  • Replication
  • Load balancing
  • Conclusion

45
Distributing Incoming Workload
Users mail map
  • Goals
  • Minimize voodoo parameter tuning.
  • Handle skewed configuration.
  • Handle skewed workload.
  • Lightweight.
  • Reconcile affinity load balance.
  • Strategy Local spread-based load balancing.
  • Spread soft limit on mailmap . .
  • Load measure of pending disk I/O requests.

1. Add nodes if mailmapltspread
Nodes load (cached)
2. Pick least loaded node(s) from the set
46
Choosing the Optimal Spread Limit
  • Trade off
  • Large spread ? more nodes to choose from.
  • ? better load balance.
  • Small spread ? fewer files to access.
  • ? better overall throughput.
  • Spread 2 optimal for uniform configuration.
  • Spread gt 2 (e.g., 4) for heterogeneous
    configuration.

47
How Well does Porcupine Support Heterogeneous
Clusters?
16.8m/day (25)
0.5m/day (0.8)
48
Presentation Outline
  • Overview
  • Functionally homogeneous clustering
  • Challenges and solutions
  • Evaluation
  • Conclusion
  • Summary
  • Future directions

49
Conclusions
  • Cheap, fast, available, and manageable clusters
    can be built for data-intensive Internet
    services.
  • Key ideas can be extended beyond mail.
  • Dynamic data and function distribution.
  • Automatic reconfiguration.
  • High-throughput, optimistic replication.
  • Load balancing.
  • Exploiting application semantics.
  • Use of soft state.
  • Optimism.

50
Future Directions
  • Geographical distribution.
  • Running multiple services.
  • Software reuse.

51
Example Replica Removal
A
C
B
A
B
C
A
B
C
310pm
A
B
C
A
C
52
Example Replica Removal
A
C
B
A
B
A
B
C
310pm
310pm
A
B
C
A
B
C
A
C
A
B
A
Targets
New replica set
Ack set
53
Example Replica Removal
A
C
B
A
B
A
B
C
A
B
C
54
Example Replica Removal
A
C
B
A
B
310pm
A
B
A
C
310pm
A
B
C
A
B
A
B
55
Example Updating Contents
Object contents
Replica set
A
B
C
A
B
C
Timestamp
C
Ack set
310pm
A
A
A
B
C
Update record (exists only during update
propagation)
B
56
Example Update Propagation
A
B
C
A
B
C
310pm
310pm
A
A
C
A
B
C
A
C
310pm
A
B
B
57
Update Retirement
Retire 310pm
A
B
C
A
B
C
310pm
310pm
A
B
C
A
C
A
B
C
A
C
310pm
A
B
Retire 310pm
B
58
Example Final State
  • Algorithm quiescent after update retirement
  • New contents absent from the update record
  • Contents are read from replica directly
  • Update stored only during propagation
  • Computational space efficiency

A
A
B
C
B
A
B
C
C
A
B
C
59
Handling Long-term Failures
  • Algorithm maintains consistency of remaining
    replicas.
  • But updates will get stuck and clog nodes
    disks.
  • Solution erase dead nodes names from replica
    sets update records after the grace period.

60
Replication Space Overhead
? 6-17 MB for replica setupdate records.
? 2000 MB for email msgs
61
Scaling to Large User Population
  • Large user population increases the memory
    requirement.
  • Recovery cost grows linearly w/ per-node user
    population.

62
Rebalancing
  • Load balancing may cause suboptimal data
    distribution after node addition/retirement.
  • Resource wasted during night (1/2 to 1/5
    traffic).
  • Rebalancer
  • Runs during midnight.
  • Adds replicas for under-replicated objects.
  • Removes replicas for over-replicated objects.
  • Deletes objects without owners.
Write a Comment
User Comments (0)
About PowerShow.com