Big Data Analysis and Mining - PowerPoint PPT Presentation

Loading...

PPT – Big Data Analysis and Mining PowerPoint presentation | free to download - id: 80e1ed-Y2UwN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Big Data Analysis and Mining

Description:

Title: Business Intelligence and Big Data Author: wrao Last modified by: Weixiong Rao Created Date: 2/20/2014 2:49:06 AM Document presentation format – PowerPoint PPT presentation

Number of Views:238
Avg rating:3.0/5.0
Slides: 111
Provided by: wrao
Learn more at: http://sse.tongji.edu.cn
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Big Data Analysis and Mining


1
Big Data Analysis and Mining
  • Weixiong Rao ???
  • Tongji University ????????
  • 2015 Fall
  • wxrao_at_tongji.edu.cn

Some of the slides are from Dr Jure Leskovecs
and Prof. Zachary G. Ives
2
DAM is here!
Product Recommendation
3
Web Search Ranking
4
Spam e-Mail Detection
5
Traditional DAM
Oracle DB
IBM DW product on very powerful servers
SAP ERP
Salesforce CRM
Flat Files from Legancy System
DAM tools
6
Big Data
  • Typical large enterprise
  • 5,000-50,000 servers, Terabytes of data, millions
    of Txn per day.
  • In contrast, many Internet companies
  • Millions of servers, petabytes of data
  • Google
  • Lots and lots of Web pages
  • Billions of Google queries per day
  • Facebook
  • A billion Facebook users
  • Billion Facebook pages
  • Twitter
  • Hundreds of million Twitter accounts
  • Hundreds of million Tweets per day

7
Nowsdays DAM solutions
  • Google, Facebook, LinkedIn, eBay, Amazon...
    didnot use the traditional data warehouse
    products for DAM.
  • Why? CAP theorem
  • Different assumptions lead to different solutions
  • What?
  • Massive parallism
  • Hadoop MapReduce paradigm
  • UC Berkeley shark/spark

8
Whats DAM?
  • Analysis of data is a process of inspecting,
    cleaning, transforming, and modeling data with
    the goal of discovering useful information,
    suggesting conclusions, and supporting decision
    making. 
  • Data mining is a particular data analysis
    technique that focuses on modeling and knowledge
    discovery for predictive rather than purely
    descriptive purposes.

9
Whats big DAM?
  • Big data is the term for a collection of data
    sets so large and complex that it becomes
    difficult to process using on-hand database
    management tools or traditional data processing
    applications.
  • The challenges include capture, curation,
    storage search, sharing, transfer, analysis and
    visualization
  • Our course How to do DAM in the Big data context
  • Data Mining Predictive Analytics Data Science
    Business Intelligence
  • Big data mining Massive data analysis

10
Lets focus on big DAM -what matters when
dealing with data?
11
Lets focus on big DAM - cultures of data
minging?
  • Data mining overlaps with ?
  • Databases Large-scale data, simple queries
  • Machine learning Small data, Complex models
  • CS Theory (Randomized) Algorithms
  • Different cultures ?
  • To a DB person, data mining is an extreme form of
    analytic processing queries that examine large
    amounts of data
  • Result is the query answer
  • To a ML person, data-mining is the inference of
    models
  • Result is the parameters of the model

12
Lets focus on big data mining
  • This class overlaps with machine learning,
    statistics, artificial intelligence, databases
    but more stress on
  • Scalability (big data)
  • Algorithms
  • Computing architectures
  • Automation for handling real big data
  • The required background
  • Data structure and Algorithm design
  • Probability and Linear algebra
  • Operating System
  • Java program design

13
What will we learn?
  • We will learn to mine different types of data
  • Data is high dimensional
  • Data is a graph
  • Data is infinite/never-ending
  • Data is labeled
  • We will learn to use different models of
    computation
  • Matlab Hadoop Spark
  • Streams and online algorithms
  • Single machine in-memory

14
What will we learn?
  • We will learn to solve real-world problems
  • Recommender systems
  • Market Basket Analysis
  • Spam detection
  • Duplicate document detection
  • We will learn various tools ?
  • Optimization (stochastic gradient descent)
  • Dynamic programming (frequent itemsets)
  • Hashing (LSH, Bloom filters)

From Dr Jure Leskovecs slides.
15
The course landscape
Matlab Hadoop Apache Spark
Apps
ML alg.
Data
Graph data
High dim. data
Infinite data
16
About the course
  • Teaching Assistants (TAs)
  • ?
  • Office Hours
  • Weixiong every Tuesday 13-15PM (SSE building 422
    room)
  • TAs ?
  • Course Website
  • soon
  • Textbook

17
Workload for the course
  • 4 Homework 20
  • 3 Quizzs 30
  • Final exam 25
  • Project 25

Not Finalized!
18
Platforms for Big Data Mining
  • Parallel DBMS technologies
  • Proposed in the late eighties
  • Matured over the last two decades
  • Multi-billion dollar industry Proprietary DBMS
    Engines
  • intended as Data Warehousing solutions for very
    large enterprises
  • Hadoop
  • Spark
  • UC Berkeley

19
Parallel DBMS (PDBMS) technologies
  • Popularly used for more than two decades
  • Research Projects Gamma, Grace,
  • Commercial Multi-billion dollar industry but
    access to only a privileged few
  • Relational Data Model
  • Indexing
  • Familiar SQL interface
  • Advanced query optimization
  • Well understood and studied
  • Very reliable!

20
MapReduce
  • Overview
  • Data-parallel programming model
  • An associated parallel and distributed
    implementation for commodity clusters
  • Pioneered by Google
  • Processes 20 PB of data per day (circa 2008)
  • Popularized by open-source Hadoop project
  • Used by Yahoo!, Facebook, Amazon, and the list is
    growing

21
Open Discussion btw PDBMS Vs MR
  • PDBMS community
  • MR community
  1. MapReduce A major step backwards
  2. A Comparison of Approaches to Large-Scale Data
    Analysis
  3. MapReduce and Parallel DBMSs Friends or Foes?


  1. MapReduce A Flexible Data Processing Tool

22
PDBMS Vs MR
PDBMS MR
Schema Support Not out of the box
Indexing
Programming Model Declarative (SQL) Imperative (C/C, Java, ) Extensions through Pig and Hive
Query Optimization
Flexibility
Fault Tolerance Coarse grained techniques
23
Single Node Architecture
24
Motivation Google Example
  • 20 billion web pages x 20KB 400 TB
  • 1 computer reads 30-35 MB/sec from disk ?
  • 4 months to read the web
  • Takes even more to do something useful with the
    data!
  • Recently standard architecture for such problems
    emerged ?
  • Cluster of commodity Linux nodes
  • Commodity network (ethernet) to connect them

25
Cluster Architecture
26
Google server room in Council Bluffs, Iowa
Data centers consume up to 1.5 percent of all the
worlds electricity The huge fans sound like jet
engines jacked through Marshall amps.
27
A central cooling plant in Googles Douglas
County, Georgia, data center
http//www.wired.com/wiredenterprise/2012/10/ff-in
side-google-data-center/all/
28
Large-scale Computing
  • Large-scale computing for data mining problems on
    commodity hardware
  • Challenges ?
  • How do you distribute computation?
  • How can we make it easy to write distributed
    programs?
  • Machines fail (fault tolerance) ?
  • One server may stay up 3 years (1,000 days)
  • If you have 1,000 servers, expect to loose 1/day
  • With 1M machines 1,000 machines fail every day!

29
Basic Idea
  • Issue Copying data over a network takes time
  • Idea ?
  • Bring computation to data
  • Store files multiple times for reliability
  • MapReduce addresses these problems
  • Storage Infrastructure File system ?
  • Google GFS.
  • Hadoop HDFS
  • Programming model ?
  • MapReduce

30
Storage Infrastructure
  • Problem ?
  • If nodes fail, how to store data persistently?
  • Answer ?
  • Distributed File System ?
  • Provides global file namespace
  • Typical usage pattern ?
  • Huge files (100s of GB to TB)
  • Data is rarely updated in place
  • Reads and appends are common

Key assumption
31
Distributed File System
  • Chunk servers ?
  • File is split into contiguous chunks
  • Typically each chunk is 16-64MB
  • Each chunk replicated (usually 2x or 3x)
  • Try to keep replicas in different racks
  • Master node ?
  • a.k.a. Name Node in Hadoops HDFS
  • Stores metadata about where files are stored
  • Might be replicated
  • Client library for file access ?
  • Talks to master to find chunk servers
  • Connects directly to chunk servers to access data

32
Distributed File System
  • Reliable distributed file system
  • Data kept in chunks spread across machines
  • Each chunk replicated on different machines
  • Seamless recovery from disk or machine failure

33
Basic Idea
  • Issue Copying data over a network takes time
  • Idea ?
  • Bring computation to data
  • Store files multiple times for reliability
  • MapReduce addresses these problems
  • Storage Infrastructure File system ?
  • Google GFS.
  • Hadoop HDFS
  • Programming model ?
  • MapReduce

34
What is HDFS (Hadoop Distributed File System)?
  • HDFS is a distributed file system
  • Makes some unique tradeoffs that are good for
    MapReduce
  • What HDFS does well
  • Very large read-only or append-only files
    (individual files may contain Gigabytes/Terabytes
    of data)
  • Sequential access patterns
  • What HDFS does not do well
  • Storing lots of small files
  • Low-latency access
  • Multiple writers
  • Writing to arbitrary offsets in the file

34
University of Pennsylvania
35
HDFS versus NFS
Network File System (NFS)
Hadoop Distributed File System (HDFS)
  • Single machine makes part of its file system
    available to other machines
  • Sequential or random access
  • PRO Simplicity, generality, transparency
  • CON Storage capacity and throughput limited by
    single server
  • Single virtual file system spread over many
    machines
  • Optimized for sequential read and local accesses
  • PRO High throughput, high capacity
  • CON Specialized for particular types of
    applications

36
How data is stored in HDFS
foo.txt 3,9,6bar.data 2,4
block 9 of foo.txt?
9
3
Name node
9
4
2
6
9
Read block 9
4
2
3
9
2
6
3
4
9
Data nodes
Client
  • Files are stored as sets of (large) blocks
  • Default block size 64 MB (ext4 default is 4kB!)
  • Blocks are replicated for durability and
    availability
  • What are the advantages of this design?
  • Namespace is managed by a single name node
  • Actual data transfer is directly between client
    data node
  • Pros and cons of this decision?

37
The Namenode
Created abc.txtAppended block 21 to
blah.txt Deleted foo.txt Appended block 22 to
blah.txt Appended block 23 to xyz.img ...
foo.txt 3,9,6bar.data 2,4 blah.txt
17,18,19,20 xyz.img 8,5,1,11
fsimage
Name node
edits
  • State stored in two files fsimage and edits
  • fsimage Snapshot of file system metadata
  • edits Changes since last snapshot
  • Normal operation
  • When namenode starts, it reads fsimage and then
    applies all the changes from edits sequentially
  • Pros and cons of this design?

38
The Secondary Namenode
  • What if the state of the namenode is lost?
  • Data in the file system can no longer be read!
  • Solution 1 Metadata backups
  • Namenode can write its metadata to a local disk,
    and/or to a remote NFS mount
  • Solution 2 Secondary Namenode
  • Purpose Periodically merge the edit log with the
    fsimage to prevent the log from growing too large
  • Has a copy of the metadata, which can be used to
    reconstruct the state of the namenode
  • But State lags behind somewhat, so data loss is
    likely if the namenode fails

39
Accessing data in HDFS
ahae_at_carbon ls -la /tmp/hadoop-ahae/dfs/data/
current/ total 209588 drwxrwxr-x 2 ahae ahae
4096 2013-10-08 1546 . drwxrwxr-x 5 ahae ahae
4096 2013-10-08 1539 .. -rw-rw-r-- 1 ahae ahae
11568995 2013-10-08 1544 blk_-3562426239750716067
-rw-rw-r-- 1 ahae ahae 90391 2013-10-08 1544
blk_-3562426239750716067_1020.meta -rw-rw-r-- 1
ahae ahae 4 2013-10-08 1540
blk_5467088600876920840 -rw-rw-r-- 1 ahae ahae
11 2013-10-08 1540 blk_5467088600876920840_101
9.meta -rw-rw-r-- 1 ahae ahae 67108864 2013-10-08
1544 blk_7080460240917416109 -rw-rw-r-- 1 ahae
ahae 524295 2013-10-08 1544 blk_708046024091741
6109_1020.meta -rw-rw-r-- 1 ahae ahae 67108864
2013-10-08 1544 blk_-8388309644856805769 -rw-rw-r
-- 1 ahae ahae 524295 2013-10-08 1544
blk_-8388309644856805769_1020.meta -rw-rw-r-- 1
ahae ahae 67108864 2013-10-08 1544
blk_-9220415087134372383 -rw-rw-r-- 1 ahae ahae
524295 2013-10-08 1544 blk_-9220415087134372383_1
020.meta -rw-rw-r-- 1 ahae ahae 158
2013-10-08 1540 VERSION ahae_at_carbon
  • HDFS implements a separate namespace
  • Files in HDFS are not visible in the normal file
    system
  • Only the blocks and the block metadata are
    visible
  • HDFS cannot be (easily) mounted
  • Some FUSE drivers have been implemented for it

40
Accessing data in HDFS
ahae_at_carbon /usr/local/hadoop/bin/hadoop fs
-ls /user/ahae Found 4 items -rw-r--r-- 1 ahae
supergroup 1366 2013-10-08 1546
/user/ahae/README.txt -rw-r--r-- 1 ahae
supergroup 0 2013-10-083 1535
/user/ahae/input -rw-r--r-- 1 ahae supergroup
0 2013-10-08 1539 /user/ahae/input2 -rw-r-
-r-- 1 ahae supergroup 212895587 2013-10-08
1544 /user/ahae/input3 ahae_at_carbon
  • File access is through the hadoop command
  • Examples
  • hadoop fs -put file hdfsPath Stores a file in
    HDFS
  • hadoop fs -ls hdfsPath List a directory
  • hadoop fs -get hdfsPath file Retrieves a
    file from HDFS
  • hadoop fs -rm hdfsPath Deletes a file in HDFS
  • hadoop fs -mkdir hdfsPath Makes a directory in
    HDFS

41
Alternatives to the command line
  • Getting data in and out of HDFS through the
    command-line interface is a bit cumbersome
  • Alternatives have been developed
  • FUSE file system Allows HDFS to be mounted under
    Unix
  • WebDAV share Can be mounted as filesystem on
    many OSes
  • HTTP Read access through namenode's embedded web
    svr
  • FTP Standard FTP interface
  • ...

42
Accessing HDFS directly from Java
  • Programs can read/write HDFS files directly
  • Not needed in MapReduce
  • I/O is handled by the framework
  • Files are represented as URIs
  • Example hdfs//localhost/user/ahae/example.txt
  • Access is via the FileSystem API
  • To get access to the file FileSystem.get()
  • For reading, call open() -- returns InputStream
  • For writing, call create() -- returns OutputStream

43
What about permissions?
  • Since 0.16.1, Hadoop has rudimentary support for
    POSIX-style permissions
  • rwx for users, groups, 'other' -- just like in
    Unix
  • 'hadoop fs' has support for chmod, chgrp, chown
  • But POSIX model is not a very good fit
  • Many combinations are meaningless Files cannot
    be executed, and existing files cannot really be
    written to
  • Permissions were not really enforced
  • Hadoop does not verify whether user's identity is
    genuine
  • Useful more to prevent accidental data corruption
    or casual misuse of information

44
Where are things today?
  • Since v.20.20x, Hadoop has some security
  • Kerberos RPC (SASL/GSSAPI)
  • HTTP SPNEGO authentication for web consoles
  • HDFS file permissions actually enforced
  • Various kinds of delegation tokens
  • Network encryption
  • For more details, seehttps//issues.apache.org/j
    ira/secure/attachment/12428537/security-design.pdf
  • Big changes are coming
  • Project Rhino (e.g., encrypted data at rest)

45
Recap HDFS
  • HDFS A specialized distributed file system
  • Good for large amounts of data, sequential reads
  • Bad for lots of small files, random access,
    non-append writes
  • Architecture Blocks, namenode, datanodes
  • File data is broken into large blocks (64MB
    default)
  • Blocks are stored replicated by datanodes
  • Single namenode manages all the metadata
  • Secondary namenode Housekeeping (some)
    redundancy
  • Usage Special command-line interface
  • Example hadoop fs -ls /path/in/hdfs

46
Basic Idea
  • Issue Copying data over a network takes time
  • Idea ?
  • Bring computation to data
  • Store files multiple times for reliability
  • MapReduce addresses these problems
  • Storage Infrastructure File system ?
  • Google GFS.
  • Hadoop HDFS
  • Programming model ?
  • MapReduce

47
Recall HashTable
Hash Function maps input keys to buckets.
48
From HashTable to Distributed Hash Table (DHT)
Node-1
Node-2
...
Node-n
Disibuted Hash Function maps input keys to
physical nodes.
49
From DHT to MapReduce
Node-1
Node-2
...
Node-n
Map()
Reduce()
50
The MapReduce programming model
  • MapReduce is a distributed programming model
  • In many circles, considered the key building
    block for much of Googles data analysis
  • A programming language built on it
    Sawzall,http//labs.google.com/papers/sawzall.htm
    l
  • Sawzall has become one of the most widely used
    programming languages at Google. On one
    dedicated Workqueue cluster with 1500 Xeon CPUs,
    there were 32,580 Sawzall jobs launched, using an
    average of 220 machines each. While running those
    jobs, 18,636 failures occurred (application
    failure, network outage, system crash, etc.) that
    triggered rerunning some portion of the job. The
    jobs read a total of 3.2x1015 bytes of data
    (2.8PB) and wrote 9.9x1012 bytes (9.3TB).
  • Other similar languages Yahoos Pig Latin and
    Pig Microsofts Dryad
  • Cloned in open source Hadoop,http//hadoop.apach
    e.org/

51
The MapReduce programming model
  • Simple distributed functional programming
    primitives
  • Modeled after Lisp primitives
  • map (apply function to all items in a collection)
    and
  • reduce (apply function to set of items with a
    common key)
  • We start with
  • A user-defined function to be applied to all
    data,map (key,value) ? (key, value)
  • Another user-specified operation reduce (key,
    set of values) ? result
  • A set of n nodes, each with data
  • All nodes run map on all of their data, producing
    new data with keys
  • This data is collected by key, then shuffled, and
    finally reduced
  • Dataflow is through temp files on GFS

52
Simple example Word count
map(String key, String value) // key
document name, line no // value contents of
line
reduce(String key, Iterator values)
// key a word // values a list of counts
int result 0 for each v in values
result ParseInt(v) emit(key, result)
for each word w in value emit(w, "1")
  • Goal Given a set of documents, count how often
    each word occurs
  • Input Key-value pairs (documentlineNumber,
    text)
  • Output Key-value pairs (word, occurrences)
  • What should be the intermediate key-value pairs?

53
Simple example Word count
Key range the node is responsible for
(apple, 3)
(apple, 1)
(apple, 1)
(apple, 1)
(apple, 1, 1, 1)
Mapper(1-2)
Reducer(A-G)
(an, 2)
(an, 1, 1)
(an, 1)
(an, 1)
(because, 1)
(1, the apple)
(because, 1)
(because, 1)
(green, 1)
(green, 1)
(green, 1)
(2, is an apple)
Mapper(3-4)
Reducer(H-N)
(is, 2)
(is, 1)
(is, 1)
(is, 1, 1)
(3, not an orange)
(not, 2)
(not, 1)
(not, 1)
(not, 1, 1)
(4, because the)
(5, orange)
(orange, 1)
(orange, 1)
(orange, 1)
(orange, 3)
(orange, 1, 1, 1)
Mapper(5-6)
Reducer(O-U)
(the, 1)
(the, 1)
(the, 1)
(the, 3)
(the, 1, 1, 1)
(6, unlike the apple)
(unlike, 1)
(unlike, 1)
(unlike, 1)
(7, is orange)
(8, not green)
Mapper(7-8)
Reducer(V-Z)
The reducers process theirinput one groupat a
time
The mappersprocess the KV-pairs one by one
Each mapper receives some of the KV-pairs as
input
Each KV-pair outputby the mapper is sent to the
reducer that is responsible for it
The reducers sort their input by key and group
it
54
MapReduce dataflow
Intermediate (key,value) pairs
Reducer
Mapper
Mapper
Reducer
Input data
Output data
Mapper
Reducer
Mapper
Reducer
"The Shuffle"
What is meant by a 'dataflow'?What makes this so
scalable?
55
Steps of MapReduce
  • 3 steps of MapReduce
  • Sequentially read a lot of data
  • Map Extract something you care about
  • Group by key Sort and shuffle
  • Reduce Aggregate, summarize, filter or transform
  • Output the result

56
The Map Step
57
The Reduce Step
58
More Details
  • Input a set of key-value pairs
  • Programmer specifies two methods ?
  • Map(k, v) ? ltk, vgt ?
  • Takes a key-value pair and outputs a set of
    key-value pairs
  • E.g., key is the filename, value is a single line
    in the file
  • There is one Map call for every (k,v) pair
  • Reduce(k, ltvgt) ? ltk, vgt ?
  • All values v with same key k are reduced
    together and processed in v order
  • There is one Reduce function call per unique key
    k

59
MapReduce A Diagram
60
MapReduce In Parallel
61
More details on the MapReduce data flow
Coordinator
(Default MapReduce uses Filesystem)
Map computation partitions
Reduce computation partitions
Data partitionsby key
Redistributionby outputs key("shuffle")
62
More examples
  • Distributed grep all lines matching a pattern
  • Map filter by pattern
  • Reduce output set
  • Count URL access frequency
  • Map output each URL as key, with count 1
  • Reduce sum the counts
  • Reverse web-link graph
  • Map output (target,source) pairs when link to
    target found in souce
  • Reduce concatenates values and emits
    (target,list(source))
  • Inverted index
  • Map Emits (word,documentID)
  • Reduce Combines these into (word,list(documentID)
    )

63
What do we need to write a MR program?
  • A mapper
  • Accepts (key,value) pairs from the input
  • Produces intermediate (key,value) pairs, which
    are then shuffled
  • A reducer
  • Accepts intermediate (key,value) pairs
  • Produces final (key,value) pairs for the output
  • A driver
  • Specifies which inputs to use, where to put the
    outputs
  • Chooses the mapper and the reducer to use
  • Hadoop takes care of the rest!!
  • Default behaviors can be customized by the driver

64
The Mapper
Intermediate formatcan be freely chosen
Input format(file offset, line)
import org.apache.hadoop.mapreduce. import
org.apache.hadoop.io. public class FooMapper
extends MapperltLongWritable, Text, Text, Textgt
public void map(LongWritable key, Text value,
Context context) context.write(new
Text("foo"), value)
  • Extends abstract 'Mapper' class
  • Input/output types are specified as type
    parameters
  • Implements a 'map' function
  • Accepts (key,value) pair of the specified type
  • Writes output pairs by calling 'write' method on
    context
  • Mixing up the types will cause problems at
    runtime (!)

65
The Reducer
Intermediate format(same as mapper output)
Output format
import org.apache.hadoop.mapreduce. import
org.apache.hadoop.io. public class FooReducer
extends ReducerltText, Text, IntWritable, Textgt
public void reduce(Text key, IterableltTextgt
values, Context context) throws
java.io.IOException, InterruptedException
for (Text value values)
context.write(new IntWritable(4711), value)

Note We may getmultiple values forthe same key!
  • Extends abstract 'Reducer' class
  • Must specify types again (must be compatible with
    mapper!)
  • Implements a 'reduce' function
  • Values are passed in as an 'Iterable'
  • Caution These are NOT normal Java classes. Do
    not store them in collections - content can
    change between iterations!

66
The Driver
import org.apache.hadoop.mapreduce.import
org.apache.hadoop.io.import org.apache.hadoop.f
s.Pathimport org.apache.hadoop.mapreduce.lib.inp
ut.FileInputFormatimport org.apache.hadoop.mapre
duce.lib.output.FileOutputFormat public class
FooDriver public static void main(String
args) throws Exception Job job new
Job() job.setJarByClass(FooDriver.class)
FileInputFormat.addInputPath(job, new
Path("in")) FileOutputFormat.setOutputPath(jo
b, new Path("out")) job.setMapperClass(FooMa
pper.class) job.setReducerClass(FooReducer.cl
ass) job.setOutputKeyClass(Text.class)
job.setOutputValueClass(Text.class)
System.exit(job.waitForCompletion(true) ? 0
1)
MapperReducer arein the same Jar asFooDriver
Input and Outputpaths
Format of the (key,value)pairs output by
thereducer
  • Specifies how the job is to be executed
  • Input and output directories mapper reducer
    classes

67
Manual compilation
  • Goal Produce a JAR file that contains the
    classes for mapper, reducer, and driver
  • This can be submitted to the Job Tracker, or run
    directly through Hadoop
  • Step 1 Put hadoop-core-1.0.3.jar into
    classpath export CLASSPATHCLASSPATH/pa
    th/to/hadoop/hadoop-core-1.0.3.jar
  • Step 2 Compile mapper, reducer, driver
    javac FooMapper.java FooReducer.java
    FooDriver.java
  • Step 3 Package into a JAR file jar cvf
    Foo.jar .class
  • Alternative "Export..."/"Java JAR file" in
    Eclipse

68
Standalone mode installation
  • What is standalone mode?
  • Installation on a single node
  • No daemons running (no Task Tracker, Job Tracker)
  • Hadoop runs as an 'ordinary' Java program
  • Used for debugging
  • How to install Hadoop in standalone mode?
  • See Textbook Appendix A
  • Already done in your VM image

69
Running a job in standalone mode
  • Step 1 Create populate input directory
  • Configured in the Driver via addInputPath()
  • Put input file(s) into this directory (ok to have
    more than 1)
  • Output directory must not exist yet
  • Step 2 Run Hadoop
  • As simple as this hadoop jar ltjarNamegt
    ltdriverClassNamegt
  • Example hadoop jar foo.jar upenn.nets212.FooDrive
    r
  • In verbose mode, Hadoop will print statistics
    while running
  • Step 3 Collect output files

70
Recap Writing simple jobs for Hadoop
  • Write a mapper, reducer, driver
  • Custom serialization ? Must use special data
    types (Writable)
  • Explicitly declare all three (key,value) types
  • Package into a JAR file
  • Must contain class files for mapper, reducer,
    driver
  • Create manually (javac/jar) or automatically
    (ant)
  • Running in standalone mode
  • hadoop jar foo.jar FooDriver
  • Input and output directories in local file system

71
Common mistakes to avoid
  • Mapper and reducer should be stateless
  • Don't use static variables - after map reduce
    return, they should remember nothing about the
    processed data!
  • Reason No guarantees about which key-value
    pairs will be processed by which workers!
  • Don't try to do your own I/O!
  • Don't try to read from, or write to, files in
    the file system
  • The MapReduce framework does all the I/O for
    you
  • All the incoming data will be fed as arguments to
    map and reduce
  • Any data your functions produce should be output
    via emit

HashMap h new HashMap() map(key, value) if
(h.contains(key)) h.add(key,value)
emit(key, "X")
Wrong!
map(key, value) File foo new
File("xyz.txt") while (true) s
foo.readLine() ...
Wrong!
72
More common mistakes to avoid
map(key, value) emit("FOO", key " "
value)
reduce(key, value) / do some computation
on all the values /
Wrong!
  • Mapper must not map too much data to the same key
  • In particular, don't map everything to the same
    key!!
  • Otherwise the reduce worker will be overwhelmed!
  • It's okay if some reduce workers have more work
    than others
  • Example In WordCount, the reduce worker that
    works on the key 'and' has a lot more work than
    the reduce worker that works on 'syzygy'.

73
Designing MapReduce algorithms
  • Key decision What should be done by map, and
    what by reduce?
  • map can do something to each individual key-value
    pair, but it can't look at other key-value pairs
  • Example Filtering out key-value pairs we don't
    need
  • map can emit more than one intermediate key-value
    pair for each incoming key-value pair
  • Example Incoming data is text, map produces
    (word,1) for each word
  • reduce can aggregate data it can look at
    multiple values, as long as map has mapped them
    to the same (intermediate) key
  • Example Count the number of words, add up the
    total cost, ...
  • Need to get the intermediate format right!
  • If reduce needs to look at several values
    together, map must emit them using the same key!

74
Some additional details
  • To make this work, we need a few more parts
  • The file system (distributed across all nodes)
  • Stores the inputs, outputs, and temporary results
  • The driver program (executes on one node)
  • Specifies where to find the inputs, the outputs
  • Specifies what mapper and reducer to use
  • Can customize behavior of the execution
  • The runtime system (controls nodes)
  • Supervises the execution of tasks
  • Esp. JobTracker

75
Some details
  • Fewer computation partitions than data partitions
  • All data is accessible via a distributed
    filesystem with replication
  • Worker nodes produce data in key order (makes it
    easy to merge)
  • The master is responsible for scheduling, keeping
    all nodes busy
  • The master knows how many data partitions there
    are, which have completed atomic commits to
    disk
  • Locality Master tries to do work on nodes that
    have replicas of the data
  • Master can deal with stragglers (slow machines)
    by re-executing their tasks somewhere else

76
What if a worker crashes?
  • We rely on the file system being shared across
    all the nodes
  • Two types of (crash) faults
  • Node wrote its output and then crashed
  • Here, the file system is likely to have a copy of
    the complete output
  • Node crashed before finishing its output
  • The JobTracker sees that the job isnt making
    progress, and restarts the job elsewhere on the
    system
  • (Of course, we have fewer nodes to do work)
  • But what if the master crashes?

77
Other challenges
  • Locality
  • Try to schedule map task on machine that already
    has data
  • Task granularity
  • How many map tasks? How many reduce tasks?
  • Dealing with stragglers
  • Schedule some backup tasks
  • Saving bandwidth
  • E.g., with combiners
  • Handling bad records
  • "Last gasp" packet with current sequence number

78
Scale and MapReduce
  • From a particular Google paper on a language
    built over MapReduce
  • Sawzall has become one of the most widely used
    programming languages at Google. On one
    dedicated Workqueue cluster with 1500 Xeon CPUs,
    there were 32,580 Sawzall jobs launched, using an
    average of 220 machines each. While running
    those jobs, 18,636 failures occurred (application
    failure, network outage, system crash, etc.) that
    triggered rerunning some portion of the job. The
    jobs read a total of 3.2x1015 bytes of data
    (2.8PB) and wrote 9.9x1012 bytes (9.3TB).

79
MapReduceSimplified Data Processing on Large
Clusters
Appeared inOSDI'04 Sixth Symposium on
Operating System Design and Implementation,San
Francisco, CA, December, 2004.
  • The slides are from
  • Jeff Dean, Sanjay GhemawatGoogle, Inc.

80
Motivation Large Scale Data Processing
  • Many tasks Process lots of data to produce other
    data
  • Want to use hundreds or thousands of CPUs
  • ... but this needs to be easy
  • MapReduce provides
  • Automatic parallelization and distribution
  • Fault-tolerance
  • I/O scheduling
  • Status and monitoring

81
Programming model
  • Input Output each a set of key/value pairs
  • Programmer specifies two functions
  • map (in_key, in_value) ? list(out_key,
    intermediate_value)
  • Processes input key/value pair
  • Produces set of intermediate pairs
  • reduce (out_key, list(intermediate_value)) ?
    list(out_value)
  • Combines all intermediate values for a particular
    key
  • Produces a set of merged output values (usually
    just one)
  • Inspired by similar primitives in LISP and other
    languages

82
Example Count word occurrences
  • map(String input_key, String input_value)
  • // input_key document name
  • // input_value document contents
  • for each word w in input_value
  • EmitIntermediate(w, "1")
  • reduce(String output_key, Iterator
    intermediate_values)
  • // output_key a word
  • // output_values a list of counts
  • int result 0
  • for each v in intermediate_values
  • result ParseInt(v)
  • Emit(AsString(result))

83
Model is Widely Applicable
  • MapReduce Programs In Google Source Tree

Example uses
distributed grep   distributed sort   web link-graph reversal
term-vector per host web access log stats inverted index construction
document clustering machine learning statistical machine translation
... ... ...
84
Implementation Overview
  • Typical cluster
  • 100s/1000s of 2-CPU x86 machines, 2-4 GB of
    memory
  • Limited bisection bandwidth
  • Storage is on local IDE disks
  • GFS distributed file system manages data
    (SOSP'03)
  • Job scheduling system jobs made up of tasks,
    scheduler assigns tasks to machines
  • Implementation is a C library linked into user
    programs

85
Execution
86
Parallel Execution
87
Task Granularity And Pipelining
  • Fine granularity tasks many more map tasks than
    machines
  • Minimizes time for fault recovery
  • Can pipeline shuffling with map execution
  • Better dynamic load balancing
  • Often use 200,000 map/5000 reduce tasks w/ 2000
    machines

88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
Fault tolerance Handled via re-execution
  • On worker failure
  • Detect failure via periodic heartbeats
  • Re-execute completed and in-progress map tasks
  • Re-execute in progress reduce tasks
  • Task completion committed through master
  • Master failure
  • Could handle, but don't yet (master failure
    unlikely)
  • Robust lost 1600 of 1800 machines once, but
    finished fine Semantics in presence of
    failures see paper

100
Refinement Redundant Execution
  • Slow workers significantly lengthen completion
    time
  • Other jobs consuming resources on machine
  • Bad disks with soft errors transfer data very
    slowly
  • Weird things processor caches disabled (!!)
  • Solution Near end of phase, spawn backup copies
    of tasks
  • Whichever one finishes first "wins"
  • Effect Dramatically shortens job completion time

101
Refinement Locality Optimization
  • Master scheduling policy
  • Asks GFS for locations of replicas of input file
    blocks
  • Map tasks typically split into 64MB ( GFS block
    size)
  • Map tasks scheduled so GFS input block replica
    are on same machine or same rack
  • Effect Thousands of machines read input at local
    disk speed
  • Without this, rack switches limit read rate

102
Refinement Skipping Bad Records
  • Map/Reduce functions sometimes fail for
    particular inputs
  • Best solution is to debug fix, but not always
    possible
  • On seg fault
  • Send UDP packet to master from signal handler
  • Include sequence number of record being processed
  • If master sees two failures for same record
  • Next worker is told to skip the record
  • Effect Can work around bugs in third-party
    libraries

103
Other Refinements (see paper)
  • Sorting guarantees within each reduce partition
  • Compression of intermediate data
  • Combiner useful for saving network bandwidth
  • Local execution for debugging/testing
  • User-defined counters

104
Performance
  • Tests run on cluster of 1800 machines
  • 4 GB of memory
  • Dual-processor 2 GHz Xeons with Hyperthreading
  • Dual 160 GB IDE disks
  • Gigabit Ethernet per machine
  • Bisection bandwidth approximately 100 Gbps

MR_Sort


Two benchmarks
MR_Grep
Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records)

MR_Sort
Sort 1010 100-byte records (modeled after
TeraSort benchmark)
105
MR_Grep
  • Locality optimization helps
  • 1800 machines read 1 TB of data at peak of 31
    GB/s
  • Without this, rack switches would limit to 10
    GB/s
  • Startup overhead is significant for short jobs

106
MR_Sort
  • Backup tasks reduce job completion time
    significantly
  • System deals well with failures

Normal
No backup tasks
200 processes killed
107
Experience Rewrite of Production Indexing System
  • Rewrote Google's production indexing system using
    MapReduce
  • Set of 10, 14, 17, 21, 24 MapReduce operations
  • New code is simpler, easier to understand
  • MapReduce takes care of failures, slow machines
  • Easy to make indexing faster by adding more
    machines

108
Usage MapReduce jobs run in August 2004
  • Number of jobs 29,423
  • Average job completion time 634 secs
  • Machine days used 79,186 days
  • Input data read 3,288 TB
  • Intermediate data produced 758 TB
  • Output data written 193 TB
  • Average worker machines per job 157
  • Average worker deaths per job 1.2
  • Average map tasks per job 3,351
  • Average reduce tasks per job 55
  • Unique map implementations 395
  • Unique reduce implementations 269
  • Unique map/reduce combinations 426

109
Related Work
  • Programming model inspired by functional language
    primitives
  • Partitioning/shuffling similar to many
    large-scale sorting systems
  • NOW-Sort '97
  • Re-execution for fault tolerance
  • BAD-FS '04 and TACC '97
  • Locality optimization has parallels with Active
    Disks/Diamond work
  • Active Disks '01, Diamond '04
  • Backup tasks similar to Eager Scheduling in
    Charlotte system
  • Charlotte '96
  • Dynamic load balancing solves similar problem as
    River's distributed queues
  • River '99

110
Conclusions
  • MapReduce has proven to be a useful abstraction
  • Greatly simplifies large-scale computations at
    Google
  • Fun to use focus on problem, let library deal w/
    messy details
About PowerShow.com