Cloud Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Cloud Computing

Description:

The Computation in Pig Latin. Visits = load /data/visits' as (user, url, time) ... Pig Summarized. Somewhere between a programming language and a DBMS ... – PowerPoint PPT presentation

Number of Views:173
Avg rating:3.0/5.0
Slides: 24
Provided by: zack4
Category:
Tags: cloud | computing | pig

less

Transcript and Presenter's Notes

Title: Cloud Computing


1
Cloud Computing Languages and Architectures
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Implementing Data Management Systems
  • November 2, 2008

Slides 15-21 by Chris Olston, used with permission
2
The Core of Distributed Programming in the Cloud
MapReduce
  • In many circles, considered the key building
    block for much of Googles data analysis
  • A programming language built on it
    Sawzall,http//labs.google.com/papers/sawzall.htm
    l
  • Sawzall has become one of the most widely used
    programming languages at Google. On one
    dedicated Workqueue cluster with 1500 Xeon CPUs,
    there were 32,580 Sawzall jobs launched, using an
    average of 220 machines each. While running those
    jobs, 18,636 failures occurred (application
    failure, network outage, system crash, etc.) that
    triggered rerunning some portion of the job. The
    jobs read a total of 3.2x1015 bytes of data
    (2.8PB) and wrote 9.9x1012 bytes (9.3TB).
  • Other similar languages Yahoos Pig Latin and
    Pig Microsofts Dryad
  • Cloned in open source Hadoop,http//hadoop.apach
    e.org/core/

3
MapReduce Simple Distributed Functional
Programming Primitives
  • Modeled after Lisp primitives
  • map (apply function to all items in a
    collection) and reduce (apply function to set of
    items with a common key)
  • We start with
  • A user-defined function to be applied to all
    data,map (key,value) ? (key, value)
  • Another user-specified operation reduce (key,
    set of values) ? result
  • A set of n nodes, each with data
  • All nodes run map on all of their data, producing
    new data with keys
  • This data is collected by key, then reduced

4
Some Example Tasks
  • Count word occurrences
  • Map output word with count 1
  • Reduce sum the counts
  • Distributed grep all lines matching a pattern
  • Map filter by pattern
  • Reduce output set
  • Count URL access frequency
  • Map output each URL as key, with count 1
  • Reduce sum the counts
  • For each IP address, get the document with the
    most in-links
  • Number of queries by IP address (requires
    multiple steps)

5
MapReduce Dataflow Diagram(Default MapReduce
Uses Filesystem)
Coordinator
Datapartitions by key
Map compu-tation partitions
Reduce compu-tation partitions
Redistributionby outputs key
6
MapReduce Is Too Low-Level
  • It represents a single, two-level aggregation
    computation
  • It requires all of the logic to be encoded into
    two external functions, map and reduce, even if
    the operations are generic like selection
    operations
  • Can we do something compositional?

7
A First Take Sawzall
  • Single Map-Reduce operation
  • Based on aggregators that take tables, produce
    tables
  • count table sum of int
  • total table sum of float
  • sum_of_squares table sum of float
  • x float input
  • emit count lt- 1
  • emit total lt- x
  • emit sum_of_squares lt- x x

8
Could We Map SQL to MapReduce?
  • Select
  • Project
  • Join
  • Group-by
  • Having
  • Pros and cons of this?

9
Pig Latin and Pig
  • Pig Latin a compositional, collections-oriented
    dataflow language
  • Oriented towards parallel data processing
    analysis
  • Think of it as a series of query operators,
    without the declarative language aspects
  • Emphasizes user-defined functions, esp. those
    that have nice algebraic properties
  • Supports non-first-normal (nested) data
  • Supports external data from files
  • Pig the runtime system

10
A Simple Example Face Detection
  • Each expression creates a named collection
  • load collections from files
  • process them (e.g., per tuple) using a UDF
  • store the results into files
  • I load /mydata/images using ImageParser() as
    (id, image)
  • F foreach I generate id, detectFaces(image)
  • store F into /mydata/faces

11
Another Example Sessions Ending in Best Page
According to PageRank
  • Suppose we have two tables wed like to join,
    then compare the final rank in the sequence vs.
    other ranks

Pages
Visits
URL PageRank
www.cnn.com 0.9
www.flickr.com 0.9
www.social.com 0.7
www.digg.com 0.2
User URL Time
Alice www.cnn.com 700
Alice www.digg.com 720
Alice www.social.com 1000
Alice www.flickr.com 1005
Joe www.cnn.com/index.htm 1200
. . .
. . .
12
Parallel Evaluation
?
?
?
?
?
Parallel group-by session / choose best
?
?
?
?
?
Parallel joins

Visit lists (filesystem)

Rank lists (filesystem)
13
The Computation in Pig Latin
  • Visits load /data/visits as (user, url,
    time)
  • Visits foreach Visits generate user,
    Canonicalize(url), time
  • Pages load /data/pages as (url, pagerank)
  • VP join Visits by url, Pages by url
  • UserVisits group VP by user
  • Sessions foreach UserVisits generate
    flatten(FindSessions())
  • HappyEndings filter Sessions by BestIsLast()
  • store HappyEndings into '/data/happy_ending
    s'

14
Pig Latin Features
  • Record-oriented transformations
  • Can work over nested collections
  • Basic operators expose parallelism user-defined
    operators may not
  • Operations are explicit, not declarative
  • operators
  • FILTER
  • FOREACH GENERATE
  • GROUP
  • binary operators
  • JOIN
  • COGROUP
  • UNION

15
Pig Latin vs. Map-Reduce
  • Map-reduce combines 3 primitives
  • process records ? create groups ? process groups
  • In Pig, these primitives are
  • explicit
  • independent
  • fully composable
  • Pig adds primitives for
  • filtering tables
  • projecting tables
  • combining 2 or more tables

optimization opportunities
16
Pig System
user
Pig Latin program
cross-job optimizer
17
Key Issue Redundant Work
  • Popular tables
  • web crawl
  • search log
  • Popular transformations
  • eliminate spam pages
  • group pages by host
  • join web crawl with search log
  • Goal Minimize redundant work

18
Work-Sharing Techniques
gtgt
Join A B
A?
19
Executing Similar Jobs Together
execution engine
jobs
queue (job groups)
  • Optimal queue ordering policy?
  • New sharable jobs arrive with frequency ?1, ?2
  • Which schedule is best
  • If ?1 gtgt ?2
  • If ?1 ?2

20
Caching Data Transformations
  • Options
  • Cache Op2 output
  • Cache Op3 output
  • Cache both
  • Considerations
  • Space
  • Utility
  • Cost to generate
  • ? Difficult to estimate a priori
  • ? Can materialize fragments, and learn

21
Caching Data Moves
Join A B
22
Pig Summarized
  • Somewhere between a programming language and a
    DBMS
  • Allows distributed programming with explicit
    parallel dataflow operators
  • Runtime system does caching and batching

23
Another Option EC2
  • Amazon Elastic Computing Cloud part of their
    range of services
  • User gets many Linux VMs
  • VMs get temporary IP addresses for communication
  • Optionally have access to disk storage, or to
    Simple Storage System (key-value pairs)
  • How does this compare to Pig?
Write a Comment
User Comments (0)
About PowerShow.com