ACMS: The Akamai Configuration Management System - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

ACMS: The Akamai Configuration Management System

Description:

Use Chubby to monitor health of tablet servers, restart failed servers. GFS replicates data ... chubby authorization is checked. Write is applied to an in ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 54
Provided by: mdud
Category:

less

Transcript and Presenter's Notes

Title: ACMS: The Akamai Configuration Management System


1
ACMS The Akamai Configuration Management System
Alex Sherman, Philip A. Lisiecki , Andy
Berkheimer and Joel Wein Akamai Technologies,
Inc. Columbia University Polytechnic
University. Presented at NSDI 2005
Most slides are reproduced from the authors
original presentation
2
The Akamai Platform
  • Akamai operates a Content Delivery Network of
    15,000 servers distributed across 1,200 ISPs in
    60 countries
  • Web properties (Akamais customers) use these
    servers to bring their web content and
    applications closer to the end-users

3
Problem configuration and control
  • Even with the widely distributed platform
    customers need to maintain the control of how
    their content is served
  • Customers need to configure their service options
    with the same ease and flexibility as if it was a
    centralized, locally hosted system

4
Why difficult?
  • 15,000 servers must synchronize to the latest
    configurations within a few minutes
  • Some servers may be down or partitioned-off at
    the time of reconfiguration
  • A server that comes up after some downtime must
    re-synchronize quickly
  • Configuration may be initiated from anywhere on
    the network and must reach all other servers

5
Proposed Architecture
  • Front-end a small collection of Storage Points
    responsible for accepting, storing, and
    synchronizing configuration files
  • Back-End reliable and efficient delivery of
    configuration files to all of the edge servers -
    leverages the Akamai CDN

Publisher
Storage Points
SP
SP
SP
15,000 Edge Servers
6
Agreement
  • A publisher contacts an accepting SP
  • The accepting SP replicates a temporary file to a
    majority of SPs
  • If replication succeeds the accepting SP
    initiates an agreement algorithm called Vector
    Exchange
  • Upon success the accepting SP accepts and all
    SPs upload the new file

Publisher
Storage Points
SP
SP
SP
SP
7
Vector Exchange
  • For each agreement SPs exchange a bit vector.
  • Each bit corresponds to commitment status of a
    corresponding SP.
  • Once a majority of bits are set we say that
    agreement takes place
  • When any SP learns of an agreement it can
    upload the submission

8
Vector Exchange Guarantees
  • If a submission is accepted at least a majority
    have stored and agreed on the submission
  • The agreement is never lost by a future quorum.
    Why?
  • Any future quorum contains at least one SP that
    saw an initiated agreement.

9
Recovery Routine
  • Each SP runs a recovery routine continuously to
    query other SPs for missed agreements.
  • If SP finds that it missed an agreement it
    downloads the corresponding configuration file
  • Overtime they keep a snaphot, a list of latest
    versions of all accepted files

10
Back-end Delivery
Storage Points
  • Processes on edge servers subscribe to specific
    configurations via their local Receiver process
  • Receivers periodically query the snapshots on the
    SPs to learn of any updates.
  • If the updates match any subscriptions the
    Receivers download the files via HTTP IMS
    requests.

SP
SP
SP
11
Evaluation
  • Evaluated on top of real Akamai network
  • 48 hours in the middle of a week
  • 14,276 files submissions with five SPs
  • Most (40) are of small sizes, than 1k
  • Some 3 were greater in 10M 100M

12
Submission and propagation
  • Randomly sampled 250 edge servers to measure
    propagation time.
  • 55 seconds on avg.
  • Dominated by cache TTL and polling intervals

13
Propagation vs. File Sizes
  • Mean and 95th percentile propagation time vs.
    file size
  • 99.95 of updates arrived within 3 minutes
  • The rest delayed due to temporary connectivity
    issues

14
Discussion
  • Availability of majority SPs guarantees agreement
  • Can SPs availability metric be somehow used to
    construct the quorum with smaller size?
  • How many SPs should be there for 15,000 edge
    servers?
  • It experiments with 5. Is it too small?
  • How these SPs are selected?

15
MapReduce Simplified Data Processing on large
Clusters
  • Jeffrey Dean and Sanjay Ghemawat
  • Google, Inc.

16
What is MapReduce?
  • A programming model or design framework or design
    pattern.
  • Allows distributed processing at back-end with
    parallelization, fault-tolerance, data
    distribution and load balancing.
  • Several Projects are built on MapReduce such as
    Skynet, Hadoop, etc.

17
What is MapReduce?
  • Terms are borrowed from Functional Language (e.g.
    Lisp)?
  • (map square (1 2 3 4))?
  • (1 4 9 16)?
  • (reduce (1 4 9 16))?
  • ( 16 ( 9 ( 4 1) ) )?
  • 30

18
Map
  • Map()?
  • Process a key/value pair to generate intermediate
    key/value pairs.

Welcome 1 Everyone 1 Hello 1 Everyone 1
Welcome Everyone Hello Everyone
Input
19
Reduce
  • Reduce()?
  • Merge all intermediate values associated with the
    same key

Welcome 1 Everyone 1 Hello 1 Everyone 1
Everyone 2 Hello 1 Welcome 1
20
Some Applications
  • Distributed Grep
  • Map - emits a line if matches the supplied
    pattern
  • Reduce - Copies the the intermediate data to
    output
  • Count of URL access frequency
  • Map process web log and outputs
  • Reduce - emits
  • Reverse Web-Link Graph
  • Map process web log and outputs source
  • Reduce - emits
  • Google News, Google Search

21
What is behind MapReduce?
  • Make it distributed
  • Partition input key/value pairs into chunks, run
    map() tasks in parallel.
  • After all map()s are complete, partition the
    stored values.
  • Run reduce() in parallel

22
How MapReduce Works
  • User to do list
  • indicate
  • Input/output files
  • M number of map tasks
  • R number of reduce tasks
  • W number of machines
  • Write map and reduce functions
  • Submit the job
  • This requires no knowledge of parallel/distributed
    systems!!!
  • What about everything else?

23
(No Transcript)
24
How MapReduce Works
  • Input slices are typically 16MB to 64MB.
  • Map workers use a partitioning function to store
    intermediate key/value pair to the local disk.
  • e.g., Hash (key) mod R

Output files
Map workers
Reduce workers
partitioning
25
Fault Tolerance
  • Worker Failure
  • Master keeps 3 states for each worker task
  • (idle, in-progress, completed)?
  • Periodic ping from Master
  • If fail while in-progress, mark the task as idle.
  • If map workers fail after completed, mark as
    idle.
  • Notify the reduce task about the map worker
    failure.
  • Master Failure
  • Checkpoint

26
Locality and Backup tasks
  • Locality
  • GFS stores 3 replicas of each of 64MB chunks.
  • Attempt to schedule a map task on a machine that
    contains a replica of corresponding input data.
  • Stragglers
  • Due to Bad Disk, Network Bandwidth, CPU, or
    Memory.
  • Perform backup execution.

27
Refinements and Extension
  • Combiner Function
  • User defined
  • Done within map task.
  • Save network bandwidth.
  • Skipping Bad records
  • Best solution is to debug fix
  • Not always possible third-party source
    libraries
  • On segmentation fault
  • Send UDP packet to master from signal handler
  • Include sequence number of record being processed
  • If master sees two failures for same record
  • Next worker is told to skip the record

28
Refinements and Extensions
  • Local Execution
  • For debugging purpose.
  • Users have control on specific Map tasks.
  • Status Information
  • Master runs a HTTP server.
  • Status page shows the status of computation
  • Link to output file.
  • Standard Error list

29
Performance
  • Tests run on cluster of 1800 machines
  • 4 GB of memory
  • Dual-processor 2 GHz Xeons with Hyper threading
  • Dual 160 GB IDE disks
  • Gigabit Ethernet per machine
  • Bandwidth approximately 100 Gbps
  • Two benchmarks
  • Grep 1010 100-byte records to extract records
    matching a rare pattern (92K matching records)
  • Sort 1010 100-byte records (modeled after
    TeraSort
  • benchmark)?

30
Grep
  • Locality optimization helps
  • 1800 machines read 1 TB at peak 31 GB/s
  • W/out this, rack switches would limit to 10 GB/s
  • Startup overhead is significant for short jobs

31
Sort
M 15000 R 4000
  • Normal No backup tasks
    200 processes killed
  • Backup tasks reduce job completion time a lot!
  • System deals well with failures

32
Discussion
  • Single Point of Failure.
  • Limitation on M and R, as Master should take
    O(MR) scheduling decision and O(MR) states in
    memory.
  • Restricted Programming model.
  • MapReduce vs River.

33
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson
C. Hsieh, Deborah A. Wallach, Mike Burrows,
Tushar Chandra, Andrew Fikes, and Robert E.
Gruber Google, Inc.
34
  • Lots of data
  • Copies of the web (crawls), satellite images,
    user data, email and USENET
  • No commercial system is big enough
  • Couldnt afford it if there was one
  • Might not have made appropriate design choices
  • 450,000 machines (NYTimes estimate, June 14th
    2006)?

35
  • Scheduler (Google WorkQueue)?
  • Google File System (GFS)?
  • Chubby Lock Service
  • Other tools
  • Sawzall scripting language
  • MapReduce parallel processing
  • Bigtable is built out of using these tools

35
36
Bigtable cell
Bigtable client
Bigtable clientlibrary
Master server
performs metadata ops,load balancing
Open()?
Tablet server
Tablet server
Tablet server
serves data
serves data
serves data
Cluster Scheduling Master
GFS
Lock service
holds metadata,handles master-election
handles failover, monitoring
holds tablet data, logs
37
  • More than sixty products and projects
  • Deals with enormous amount of data
  • Crawl, 800 TB
  • Google Analytics, 200 TB
  • Google map, 0.5 TB
  • Google earth, 200 TB
  • Millions of requests per second

37
38
Google data
Crwal
Email account
38
39
  • Bigtable is a sparse, distributed, persistent,
    multi-dimensional sorted map
  • Not a relational database table, just a table!
  • No integrity constraints and relational semantics
  • No multi-row transactions
  • all transactions happen to a single row
  • It's not a database thing, it's a storage!

39
40
  • locates a cell in the table
  • Each cell contains several timestamped contents
  • column key is like familyqualifier

40
41
contents
language
aaa.com
cnn.com

EN
cnn.com/sports.html

TABLETS
aaa.com

bbc.uk
Yahoo.com/kids.html?D

Zuppa.com/menu.html
42
  • Contains some range of rows of the table
  • Built out of multiple SSTables

Tablet
Start aaa.com
Endbbc.uk
SSTable
SSTable
64K block
64K block
64K block
64K block
64K block
64K block
Index
Index
42
43
  • Immutable, sorted file of key-value pairs
  • No simultaneous read/write
  • Chunks of data plus an index
  • Index is of block ranges, not values
  • Each block is of size 64K

SSTable
64K block
64K block
64K block
Index
43
44
Bigtable
Tablet 1
Tablet 2
Tablet 3
bbc
cnn
mlp
aaa
cbc
ppp
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
44
45
Bigtable
Tablet 1
Tablet 2
Tablet 3
bbc
cnn
mlp
aaa
cbc
ppp
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
Tablet server 1
Tablet server 2
45
46
  • Three level hierarchy
  • Find which range belongs to which tablet
  • A sort of binary search
  • Similar to Unix file block lookup for an inode

46
47
  • Tablet servers manage tablets, multiple tablets
    per server. Each tablet is of 100-200 MB
  • Each tablet lives at only one server
  • Tablet server splits tablets that get too big
  • Master is responsible for load balancing and
    fault tolerance
  • Use Chubby to monitor health of tablet servers,
    restart failed servers
  • GFS replicates data

47
48
  • Tablet is first located, and accessed
  • In case of write, mutations are logged
  • chubby authorization is checked
  • Write is applied to an in-memory version
  • Logfile is stored in GFS

48
49
  • Minor compaction
  • convert the memtable into a new SSTable
  • Merging compaction
  • Reads contents from a few SSTables and memtable
  • Good place to apply policy keep only N versions
  • Major compaction
  • Merging compaction that results in only one
    SSTable
  • No deleted records, only live data

50
  • Locality groups
  • Multiple column families into a locality group
  • Separate SSTable is created to store each group
  • Compression and Caching
  • Bloom filters
  • Filters whether a SSTable contains particular
    row/col
  • Exploiting immutability
  • SSTables are immutable so, non-currency control
  • Only mutable content is memtable

51
  • A good number of tablet servers
  • Each with
  • 1 GB RAM, 2 GB disk, and Duo Opteron 2 GHz
  • 100-200 Gbps backbone bandwidth
  • Tested with millions of read/write hits

52
52
53
  • Real industrial systems are simple, and they run
    a complex big system by a very simple design.
    How?
  • How does immutability preserve consistency?
  • Immutable SSTables may waste storage.
Write a Comment
User Comments (0)
About PowerShow.com