Cloud Computing - I - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Cloud Computing - I

Description:

Yahoo Hadoop, Pig Latin. Microsoft Dryad, DryadLINQ ... espn.com. Sports. 0.9. Visits. URL Info. User. URL. Time. Amy. cnn.com. 8:00. Amy. bbc.com. 10:00 ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 50
Provided by: cyb42
Category:
Tags: cloud | computing | espn | sports

less

Transcript and Presenter's Notes

Title: Cloud Computing - I


1
Cloud Computing - I
  • Presenters Abhishek Verma, Nicolas Zea

2
Cloud Computing
  • Map Reduce
  • Clean abstraction
  • Extremely rigid 2 stage group-by aggregation
  • Code reuse and maintenance difficult
  • Google ? MapReduce, Sawzall
  • Yahoo ? Hadoop, Pig Latin
  • Microsoft ? Dryad, DryadLINQ
  • Improving MapReduce in heterogeneous environment

3
MapReduce A group-by-aggregate
Input records
Output records
map
reduce
Split
Local QSort
reduce
map
Split
shuffle
4
Shortcomings
  • Extremely rigid data flow
  • Other flows hacked in
  • Stages Joins Splits
  • Common operations must be coded by hand
  • Join, filter, projection, aggregates,
    sorting,distinct
  • Semantics hidden inside map-reduce fns
  • Difficult to maintain, extend, and optimize

M
R
5
Pig Latin A Not-So-Foreign Language for
Data Processing
  • Christopher Olston, Benjamin Reed, Utkarsh
    Srivastava, Ravi Kumar, Andrew Tomkins

Research
6
Pig Philosophy
  • Pigs Eat Anything
  • Can operate on data w/o metadata relational,
    nested, or unstructured.
  • Pigs Live Anywhere
  • Not tied to one particular parallel framework
  • Pigs Are Domestic Animals
  • Designed to be easily controlled and modified by
    its users.
  • UDFs transformation functions, aggregates,
    grouping functions, and conditionals.
  • Pigs Fly
  • Processes data quickly(?)?

7
Features
  • Dataflow language
  • Procedural different from SQL
  • Quick Start and Interoperability
  • Nested Data Model
  • UDFs as First-Class Citizens
  • Parallelism Required
  • Debugging Environment

8
Pig Latin
  • Data Model
  • Atom 'cs'
  • Tuple ('cs', 'ece', 'ee')?
  • Bag ('cs', 'ece'), ('cs')
  • Map 'courses' ? ('523', '525', '599'
  • Expressions
  • Fields by position 0
  • Fields by name f1,
  • Map Lookup

9
Example Data Analysis Task
  • Find the top 10 most visited pages in each
    category

Visits
URL Info
User URL Time
Amy cnn.com 800
Amy bbc.com 1000
Amy flickr.com 1005
Fred cnn.com 1200
10
Data Flow
Load Visits
Group by url
Foreach url generate count
Load Url Info
Join on url
Group by category
Foreach category generate top10 urls
11
In Pig Latin
  • visits load /data/visits as
    (user, url, time)
  • gVisits group visits by url
  • visitCounts foreach gVisits generate url,
    count(visits)
  • urlInfo load /data/urlInfo as (url,
    category,pRank)
  • visitCounts join visitCounts by url, urlInfo
    by url
  • gCategories group visitCounts by category
  • topUrls foreach gCategories
  • generate top(visitCounts,10)
  • store topUrls into /data/topUrls

12
Quick Start and Interoperability
  • visits load /data/visits as
    (user, url, time)
  • gVisits group visits by url
  • visitCounts foreach gVisits generate url,
    count(visits)
  • urlInfo load /data/urlInfo as (url,
    category,pRank)
  • visitCounts join visitCounts by url, urlInfo
    by url
  • gCategories group visitCounts by category
  • topUrls foreach gCategories
  • generate top(visitCounts,10)
  • store topUrls into /data/topUrls

Operates directly over files
13
Optional Schemas
  • visits load /data/visits as
    (user, url, time)
  • gVisits group visits by url
  • visitCounts foreach gVisits generate url,
    count(visits)
  • urlInfo load /data/urlInfo as (url,
    category,pRank)
  • visitCounts join visitCounts by url, urlInfo
    by url
  • gCategories group visitCounts by category
  • topUrls foreach gCategories
  • generate top(visitCounts,10)
  • store topUrls into /data/topUrls

Schemas 0ptional can be assigned dynamically
14
UDFs as First-class citizens
  • visits load /data/visits as
    (user, url, time)
  • gVisits group visits by url
  • visitCounts foreach gVisits generate url,
    count(visits)
  • urlInfo load /data/urlInfo as (url,
    category,pRank)
  • visitCounts join visitCounts by url, urlInfo
    by url
  • gCategories group visitCounts by category
  • topUrls foreach gCategories
  • generate top(visitCounts,10)
  • store topUrls into /data/topUrls

UDFs can be used in every construct
15
Operators
  • LOAD specifying input data
  • FOREACH per-tuple processing
  • FLATTEN eliminate nesting
  • FILTER discarding unwanted data
  • COGROUP getting related data together
  • GROUP, JOIN
  • STORE asking for output
  • Other UNION, CROSS, ORDER, DISTINCT

16
COGROUP Vs JOIN
17
Compilation into MapReduce
Every group or join operation forms a map-reduce
boundary
Map1
Load Visits
Group by url
Reduce1
Map2
Foreach url generate count
Load Url Info
Join on url
Reduce2
Map3
Other operations pipelined into map and reduce
phases
Group by category
Reduce3
Foreach category generate top10 urls
18
Debugging Environment
  • Write-run-debug cycle
  • Sandbox dataset
  • Objectives
  • Realism
  • Conciseness
  • Completeness
  • Problems
  • UDFs

19
Future Work
  • Optional safe query optimizer
  • Performs only high-confidence rewrites
  • User interface
  • Boxes and arrows UI
  • Promote collaboration, sharing code fragments and
    UDFs
  • Tight integration with a scripting language
  • Use loops, conditionals of host language

20
DryadLINQ A System for General Purpose
Distributed Data-Parallel Computing Using a
High-Level Language
  • Yuan Yu, Michael Isard, Dennis Fetterly, Mihai
    Budiu,
  • Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

21
Dryad System Architecture
data plane
Files, TCP, FIFO, Network
job schedule
V
V
V
NS
PD
PD
PD
control plane
Job manager
cluster
22
LINQ
CollectionltTgt collection bool IsLegal(Key) stri
ng Hash(Key) var results from c in collection
where IsLegal(c.key) select new
Hash(c.key), c.value
23
DryadLINQ Constructs
  • Partitioning Hash, Range, RoundRobin
  • Apply, Fork
  • Hints

24
Dryad LINQ DryadLINQ
CollectionltTgt collection bool IsLegal(Key
k) string Hash(Key) var results from c in
collection where IsLegal(c.key) select new
Hash(c.key), c.value
Vertexcode
Queryplan (Dryad job)
Data
collection
C
C
C
C
results
25
DryadLINQ Execution Overview
Client machine
DryadLINQ
C
Data center
Distributed query plan
Invoke
Query Expr
Query
Input Tables
ToDryadTable
Dryad Execution
JM
Output DryadTable
Results
C Objects
Output Tables
(11)
foreach
26
System Implementation
  • LINQ expressions converted to execution plan
    graph (EPG)
  • similar to database query plan
  • DAG
  • annotated with metadata properties
  • EPG is skeleton of Dryad DFG
  • as long as native operations are used, properties
    can propagate helping optimization

27
Static Optimizations
  • Pipelining
  • Multiple operations in a single process
  • Removing redundancy
  • Eager Aggregation
  • Move aggregations in front of partitionings
  • I/O Reduction
  • Try to use TCP and in-memory FIFO instead of disk
    space

28
Dynamic Optimizations
  • As information from job becomes available, mutate
    execution graph
  • Dataset size based decisions
  • Intelligent partitioning of data

29
Dynamic Optimizations
  • Aggregation can turn into tree to improve I/O
    based on locality
  • Example if part of computation is done locally,
    then aggregated before being sent across network

30
Evaluation
  • TeraSort - scalability
  • 240 computer cluster of 2.6Ghz dual core AMD
    Opterons
  • Sort 10 billion 100-byte records on 10-byte key
  • Each computer stores 3.87 GBs

31
Evaluation
  • DryadLINQ vs Dryad - SkyServer
  • Dryad is hand optimized
  • No dynamic optimization overhead
  • DryadLINQ is 10 native code

32
Main Benefits
  • High level and data type transparent
  • Automatic optimization friendly
  • Manual optimizations using Apply operator
  • Leverage any system running LINQ framework
  • Support for interacting with SQL databases
  • Single computer debugging made easy
  • Strong typing, narrow interface
  • Deterministic replay execution

33
Discussion
  • Dynamic optimizations appear data intensive
  • What kind of overhead?
  • EPG analysis overhead -gt high latency
  • No real comparison with other systems
  • Progress tracking is difficult
  • No speculation
  • Will Solid State Drives diminish advantages of
    MapReduce?
  • Why not use Parallel Databases?
  • MapReduce Vs Dryad
  • How different from Sawzall and Pig?

34
Comparison
Language Sawzall Pig Latin DryadLINQ
Built by Google Yahoo Microsoft
Programming Imperative Imperative Imperative Declarative Hybrid
Resemblance to SQL Least Moderate Most
Execution Engine Google MapReduce Hadoop Dryad
Performance Very Efficient 5-10 times slower 1.3-2 times slower
Implementation Internal, inside Google Open Source Apache-License Internal, inside Microsoft
Model Operate per record Sequence of MR DAGs
Usage Log Analysis Machine Learning Iterative computations
35
Improving MapReduce Performance in Heterogeneous
Environments
  • Matei Zaharia, Andy Konwinski, Anthony Joseph,
  • Randy Katz, Ion Stoica
  • University of California at Berkeley

36
Hadoop Speculative Execution Overview
  • Speculative tasks executed only if no failed or
    waiting avail.
  • Notion of progress
  • 3 phases of execution
  • Copy phase
  • Sort phase
  • Reduce phase
  • Each phase weighted by data processed
  • Determines whether a job failed or is a straggler
    and available for speculation

37
Hadoops Assumptions
  1. Nodes can perform work at exactly the same rate
  2. Tasks progress at a constant rate throughout time
  3. There is no cost to launching a speculative task
    on an idle node
  4. The three phases of execution take approximately
    same time
  5. Tasks with a low progress score are stragglers
  6. Maps and Reduces require roughly the same amount
    of work

38
Breaking Down the Assumptions
  • Virtualization breaks down homogeneity
  • Amazon EC2 - multiple vms on same physical host
  • Compete for memory/network bandwidth
  • Ex two map tasks can compete for disk bandwidth,
    causing one to be a straggler

39
Breaking Down the Assumptions
  • Progress threshold in Hadoop is fixed and assumes
    low progress faulty node
  • Too Many speculative tasks executed
  • Speculative execution can harm running tasks

40
Breaking Down the Assumptions
  • Tasks phases are not equal
  • Copy phase typically the most expensive due to
    network communication cost
  • Causes rapid jump from 1/3 progress to 1 of many
    tasks, creating fake stragglers
  • Real stragglers get usurped
  • Unnecessary copying due to fake stragglers
  • Progress score means anything with gt80 never
    speculatively executed

41
LATE Scheduler
  • Longest Approximate Time to End
  • Primary assumption best task to execute is the
    one that finishes furthest into the future
  • Secondary tasks make progress at approx.
    constant rate
  • Progress Rate ProgressScore/T
  • T time task has run for
  • Time to completion (1-ProgressScore)/T

42
LATE Scheduler
  • Launch speculative jobs on fast nodes
  • best chance to overcome straggler vs using first
    available node
  • Cap on total number of speculative tasks
  • Slowness minimum threshold
  • Does not take into account data locality

43
Performance Comparison Without Stragglers
  • EC2 test cluster
  • 1.0-1.2 Ghz Opteron/Xeon w/1.7 GB mem

Sort
44
Performance Comparison With Stragglers
  • Manually slowed down 8 VMs with background
    processes

Sort
45
Performance Comparison With Stragglers
WordCount
Grep
46
Sensitivity
47
Sensitivity
48
Takeaways
  1. Make decisions early
  2. Use finishing times
  3. Nodes are not equal
  4. Resources are precious

49
Further questions
  • Focusing work on small vms fair?
  • Would it be better to pay for large vm and
    implement system with more customized control?
  • Could this be used in other systems?
  • Progress tracking is key
  • Is this a fundamental contribution? Or just an
    optimization?
  • Good research?
Write a Comment
User Comments (0)
About PowerShow.com