ACMS: The Akamai Configuration Management System - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

ACMS: The Akamai Configuration Management System

Description:

Use Chubby to monitor health of tablet servers, restart failed servers. GFS replicates data ... chubby authorization is checked. Write is applied to an in ... – PowerPoint PPT presentation

Number of Views:195

Avg rating:3.0/5.0

Slides: 54

Provided by: mdud

Category:

more less

Transcript and Presenter's Notes

Title: ACMS: The Akamai Configuration Management System

1
ACMS The Akamai Configuration Management System
Alex Sherman, Philip A. Lisiecki , Andy
Berkheimer and Joel Wein Akamai Technologies,
Inc. Columbia University Polytechnic
University. Presented at NSDI 2005
Most slides are reproduced from the authors
original presentation
2
The Akamai Platform

Akamai operates a Content Delivery Network of
15,000 servers distributed across 1,200 ISPs in
60 countries
Web properties (Akamais customers) use these
servers to bring their web content and
applications closer to the end-users

3
Problem configuration and control

Even with the widely distributed platform
customers need to maintain the control of how
their content is served
Customers need to configure their service options
with the same ease and flexibility as if it was a
centralized, locally hosted system

4
Why difficult?

15,000 servers must synchronize to the latest
configurations within a few minutes
Some servers may be down or partitioned-off at
the time of reconfiguration
A server that comes up after some downtime must
re-synchronize quickly
Configuration may be initiated from anywhere on
the network and must reach all other servers

5
Proposed Architecture

Front-end a small collection of Storage Points
responsible for accepting, storing, and
synchronizing configuration files
Back-End reliable and efficient delivery of
configuration files to all of the edge servers -
leverages the Akamai CDN

Publisher
Storage Points
SP
SP
SP
15,000 Edge Servers
6
Agreement

A publisher contacts an accepting SP
The accepting SP replicates a temporary file to a
majority of SPs
If replication succeeds the accepting SP
initiates an agreement algorithm called Vector
Exchange
Upon success the accepting SP accepts and all
SPs upload the new file

Publisher
Storage Points
SP
SP
SP
SP
7
Vector Exchange

For each agreement SPs exchange a bit vector.
Each bit corresponds to commitment status of a
corresponding SP.
Once a majority of bits are set we say that
agreement takes place
When any SP learns of an agreement it can
upload the submission

8
Vector Exchange Guarantees

If a submission is accepted at least a majority
have stored and agreed on the submission
The agreement is never lost by a future quorum.
Why?
Any future quorum contains at least one SP that
saw an initiated agreement.

9
Recovery Routine

Each SP runs a recovery routine continuously to
query other SPs for missed agreements.
If SP finds that it missed an agreement it
downloads the corresponding configuration file
Overtime they keep a snaphot, a list of latest
versions of all accepted files

10
Back-end Delivery
Storage Points

Processes on edge servers subscribe to specific
configurations via their local Receiver process
Receivers periodically query the snapshots on the
SPs to learn of any updates.
If the updates match any subscriptions the
Receivers download the files via HTTP IMS
requests.

SP
SP
SP
11
Evaluation

Evaluated on top of real Akamai network
48 hours in the middle of a week
14,276 files submissions with five SPs
Most (40) are of small sizes, than 1k
Some 3 were greater in 10M 100M

12
Submission and propagation

Randomly sampled 250 edge servers to measure
propagation time.
55 seconds on avg.
Dominated by cache TTL and polling intervals

13
Propagation vs. File Sizes

Mean and 95th percentile propagation time vs.
file size
99.95 of updates arrived within 3 minutes
The rest delayed due to temporary connectivity
issues

14
Discussion

Availability of majority SPs guarantees agreement
Can SPs availability metric be somehow used to
construct the quorum with smaller size?
How many SPs should be there for 15,000 edge
servers?
It experiments with 5. Is it too small?
How these SPs are selected?

15
MapReduce Simplified Data Processing on large
Clusters

Jeffrey Dean and Sanjay Ghemawat
Google, Inc.

16
What is MapReduce?

A programming model or design framework or design
pattern.
Allows distributed processing at back-end with
parallelization, fault-tolerance, data
distribution and load balancing.
Several Projects are built on MapReduce such as
Skynet, Hadoop, etc.

17
What is MapReduce?

Terms are borrowed from Functional Language (e.g.
Lisp)?
(map square (1 2 3 4))?
(1 4 9 16)?
(reduce (1 4 9 16))?
( 16 ( 9 ( 4 1) ) )?
30

18
Map

Map()?
Process a key/value pair to generate intermediate
key/value pairs.

Welcome 1 Everyone 1 Hello 1 Everyone 1
Welcome Everyone Hello Everyone
Input
19
Reduce

Reduce()?
Merge all intermediate values associated with the
same key

Welcome 1 Everyone 1 Hello 1 Everyone 1
Everyone 2 Hello 1 Welcome 1
20
Some Applications

Distributed Grep
Map - emits a line if matches the supplied
pattern
Reduce - Copies the the intermediate data to
output
Count of URL access frequency
Map process web log and outputs
Reduce - emits
Reverse Web-Link Graph
Map process web log and outputs source
Reduce - emits
Google News, Google Search

21
What is behind MapReduce?

Make it distributed
Partition input key/value pairs into chunks, run
map() tasks in parallel.
After all map()s are complete, partition the
stored values.
Run reduce() in parallel

22
How MapReduce Works

User to do list
indicate
Input/output files
M number of map tasks
R number of reduce tasks
W number of machines
Write map and reduce functions
Submit the job
This requires no knowledge of parallel/distributed
systems!!!
What about everything else?

23
(No Transcript)
24
How MapReduce Works

Input slices are typically 16MB to 64MB.
Map workers use a partitioning function to store
intermediate key/value pair to the local disk.
e.g., Hash (key) mod R

Output files
Map workers
Reduce workers
partitioning
25
Fault Tolerance

Worker Failure
Master keeps 3 states for each worker task
(idle, in-progress, completed)?
Periodic ping from Master
If fail while in-progress, mark the task as idle.
If map workers fail after completed, mark as
idle.
Notify the reduce task about the map worker
failure.
Master Failure
Checkpoint

26
Locality and Backup tasks

Locality
GFS stores 3 replicas of each of 64MB chunks.
Attempt to schedule a map task on a machine that
contains a replica of corresponding input data.
Stragglers
Due to Bad Disk, Network Bandwidth, CPU, or
Memory.
Perform backup execution.

27
Refinements and Extension

Combiner Function
User defined
Done within map task.
Save network bandwidth.
Skipping Bad records
Best solution is to debug fix
Not always possible third-party source
libraries
On segmentation fault
Send UDP packet to master from signal handler
Include sequence number of record being processed
If master sees two failures for same record
Next worker is told to skip the record

28
Refinements and Extensions

Local Execution
For debugging purpose.
Users have control on specific Map tasks.
Status Information
Master runs a HTTP server.
Status page shows the status of computation
Link to output file.
Standard Error list

29
Performance

Tests run on cluster of 1800 machines
4 GB of memory
Dual-processor 2 GHz Xeons with Hyper threading
Dual 160 GB IDE disks
Gigabit Ethernet per machine
Bandwidth approximately 100 Gbps
Two benchmarks
Grep 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
Sort 1010 100-byte records (modeled after
TeraSort
benchmark)?

30
Grep

Locality optimization helps
1800 machines read 1 TB at peak 31 GB/s
W/out this, rack switches would limit to 10 GB/s
Startup overhead is significant for short jobs

31
Sort
M 15000 R 4000

Normal No backup tasks
200 processes killed

Backup tasks reduce job completion time a lot!
System deals well with failures

32
Discussion

Single Point of Failure.
Limitation on M and R, as Master should take
O(MR) scheduling decision and O(MR) states in
memory.
Restricted Programming model.
MapReduce vs River.

33
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson
C. Hsieh, Deborah A. Wallach, Mike Burrows,
Tushar Chandra, Andrew Fikes, and Robert E.
Gruber Google, Inc.
34

Lots of data
Copies of the web (crawls), satellite images,
user data, email and USENET
No commercial system is big enough
Couldnt afford it if there was one
Might not have made appropriate design choices
450,000 machines (NYTimes estimate, June 14th
2006)?

Scheduler (Google WorkQueue)?
Google File System (GFS)?
Chubby Lock Service
Other tools
Sawzall scripting language
MapReduce parallel processing
Bigtable is built out of using these tools

35
36
Bigtable cell
Bigtable client
Bigtable clientlibrary
Master server
performs metadata ops,load balancing
Open()?
Tablet server
Tablet server
Tablet server
serves data
serves data
serves data
Cluster Scheduling Master
GFS
Lock service
holds metadata,handles master-election
handles failover, monitoring
holds tablet data, logs
37

More than sixty products and projects
Deals with enormous amount of data
Crawl, 800 TB
Google Analytics, 200 TB
Google map, 0.5 TB
Google earth, 200 TB
Millions of requests per second

37
38
Google data
Crwal
Email account
38
39

Bigtable is a sparse, distributed, persistent,
multi-dimensional sorted map
Not a relational database table, just a table!
No integrity constraints and relational semantics
No multi-row transactions
all transactions happen to a single row
It's not a database thing, it's a storage!

39
40

locates a cell in the table
Each cell contains several timestamped contents
column key is like familyqualifier

40
41
contents
language
aaa.com
cnn.com

EN
cnn.com/sports.html

TABLETS
aaa.com

bbc.uk
Yahoo.com/kids.html?D

Zuppa.com/menu.html
42

Contains some range of rows of the table
Built out of multiple SSTables

Tablet
Start aaa.com
Endbbc.uk
SSTable
SSTable
64K block
64K block
64K block
64K block
64K block
64K block
Index
Index
42
43

Immutable, sorted file of key-value pairs
No simultaneous read/write
Chunks of data plus an index
Index is of block ranges, not values
Each block is of size 64K

SSTable
64K block
64K block
64K block
Index
43
44
Bigtable
Tablet 1
Tablet 2
Tablet 3
bbc
cnn
mlp
aaa
cbc
ppp
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
44
45
Bigtable
Tablet 1
Tablet 2
Tablet 3
bbc
cnn
mlp
aaa
cbc
ppp
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
Tablet server 1
Tablet server 2
45
46

Three level hierarchy
Find which range belongs to which tablet
A sort of binary search
Similar to Unix file block lookup for an inode

46
47

Tablet servers manage tablets, multiple tablets
per server. Each tablet is of 100-200 MB
Each tablet lives at only one server
Tablet server splits tablets that get too big
Master is responsible for load balancing and
fault tolerance
Use Chubby to monitor health of tablet servers,
restart failed servers
GFS replicates data

47
48

Tablet is first located, and accessed
In case of write, mutations are logged
chubby authorization is checked
Write is applied to an in-memory version
Logfile is stored in GFS

48
49

Minor compaction
convert the memtable into a new SSTable
Merging compaction
Reads contents from a few SSTables and memtable
Good place to apply policy keep only N versions
Major compaction
Merging compaction that results in only one
SSTable
No deleted records, only live data

Locality groups
Multiple column families into a locality group
Separate SSTable is created to store each group
Compression and Caching
Bloom filters
Filters whether a SSTable contains particular
row/col
Exploiting immutability
SSTables are immutable so, non-currency control
Only mutable content is memtable

A good number of tablet servers
Each with
1 GB RAM, 2 GB disk, and Duo Opteron 2 GHz
100-200 Gbps backbone bandwidth
Tested with millions of read/write hits

52
52
53

Real industrial systems are simple, and they run
a complex big system by a very simple design.
How?
How does immutability preserve consistency?
Immutable SSTables may waste storage.

Write a Comment

User Comments (0)