???????? --?????????? -- Hadoop -- MapReduce - PowerPoint PPT Presentation


PPT – ???????? --?????????? -- Hadoop -- MapReduce PowerPoint presentation | free to download - id: 4a0ecf-NjMzY


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

???????? --?????????? -- Hadoop -- MapReduce


-- -- Hadoop-- MapReduce / 2009/08/05 – PowerPoint PPT presentation

Number of Views:343
Avg rating:3.0/5.0
Slides: 80
Provided by: Johnw159


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: ???????? --?????????? -- Hadoop -- MapReduce

????????--?????????? -- Hadoop-- MapReduce
  • ???/???
  • ????????????
  • 2009/08/05

  • What is large data?
  • From the point of view of the infrastructure
    required to do analytics, data comes in three
  • Small data
  • Medium data
  • Large data

Source http//blog.rgrossman.com/
  • Small data
  • Small data fits into the memory of a single
  • Example a small dataset is the dataset for the
    Netflix Prize. (The Netflix Prize seeks to
    substantially improve the accuracy of predictions
    about how much someone is going to love a movie
    based on their movie preferences.)
  • The Netflix Prize dataset consists of over 100
    million movie rating files by 480 thousand
    randomly-chosen, anonymous Netflix customers that
    rated over 17 thousand movie titles.
  • This dataset is just 2 GB of data and fits into
    the memory of a laptop.

Source http//blog.rgrossman.com/
  • Medium data
  • Medium data fits into a single disk or disk array
    and can be managed by a database.
  • It is becoming common today for companies to
    create 1 to 10 TB or large data warehouses.

Source http//blog.rgrossman.com/
  • Large data
  • Large data is so large that it is challenging to
    manage it in a database and instead specialized
    systems are used.
  • Scientific experiments, such as the Large Hadron
    Collider (LHC, the world's largest and
    highest-energy particle accelerator), produce
    large datasets.
  • Log files produced by Google, Yahoo and Microsoft
    and similar companies are also examples of large

Source http//blog.rgrossman.com/
  • Large data sources
  • Most large datasets were produced by the
    scientific and defense communities.
  • Two things have changed
  • Large datasets are now being produced by a third
    community companies that provide internet
    services, such as search, on-line advertising and
    social media.
  • The ability to analyze these datasets is critical
    for advertising systems that produce the bulk of
    the revenue for these companies.

Source http//blog.rgrossman.com/
  • Large data sources
  • Two things have changed
  • This provides a measure by which to measure the
    effectiveness of analytic infrastructure and
    analytic models.
  • Using this metric, Google settled upon analytic
    infrastructure that was quite different than the
    grid-based infrastructure that is generally used
    by the scientific community.

Source http//blog.rgrossman.com/
  • What is a large data cloud?
  • A good working definition is that a large data
    cloud provides
  • storage services and
  • compute services that are layered over the
    storage services that scale to a data center and
    that have the reliability associated with a data

Source http//blog.rgrossman.com/
  • What are some of the options for working with
    large data?
  • The most mature large data cloud application is
    the open source Hadoop system, which consists of
    the Hadoop Distributed File System (HDFS) and
    Hadoops implementation of MapReduce.
  • An important advantage of Hadoop is that it has a
    very robust community supporting it and there are
    a large number of Hadoop projects, including Pig,
    which provides simple database-like operations
    over data managed by HDFS.

Source http//blog.rgrossman.com/
  • ????????,???????????
  • --???????(ASGC) ????????
  • ?????????????,??????????,???????????????
  • ????????,????????????????????
  • --?????????????????????????????

  • ?????????,????????????,?????????????
  • --?????????????????????????????
  • ???? http//www.ithome.com.tw/itadm/article.p

????vs.???? ????vs.???? ????vs.????
???? ????
????? ?????(?Google?Yahoo?IBM?Amazon?) ????(?????????CERN????????????????)
????? ????,????????????? ????????????
????  ????,?????Hadoop??,?Google GFS??????BigTable????? ????
????  ?????? ??????????
???????????? ??????????? (?x86???????4GB ????Linux?) ?????????(?????????????????????)
????????? ????????(???????????),??????????????? ?????????????????GB????????
  • ???? ?????????,???????,????????????,?????????????
  • ???????,????????????????????,?????????????????????
  • --?????????????????????????????
  • ???? http//www.ithome.com.tw/itadm/article.p

  • ????(Cloud Computing)Google??????????,???????????
  • ????(Grid Computing)????,????????????,???????????
  • ?????(In-the-Cloud)?????(Cloud Service)??????????

  • MapReduce??Google?????????????,??????????????????
  • Hadoop??Java???????????,????Google???????????,???
  • ???? http//www.ithome.com.tw/itadm/article

  • ?????????????,??????????,??,??,??????????????????
  • ?????????
  • Google? Sawzall ???Yahoo? Pig ??,?????????????????
  • Google? Sawzall ???MapReduce??,Yahoo? Pig

Hadoop Why?
  • Need to process 100TB datasets with multi-day
  • On 1-node
  • Scanning at 50 MB/s 23 days
  • On 1000 node cluster
  • Scanning at 50 MB/s 33 min
  • Need framework for distribution
  • Efficient, reliable, usable

Hadoop Where?
  • Batch data processing, not real-time/user facing
  • Log Processing
  • Document Analysis and Indexing
  • Web Graphs and Crawling
  • Highly parallel, data intensive, distributed
  • Bandwidth to data is a constraint
  • Number of CPUs is a constraint
  • Very large production deployments (GRID)
  • Several clusters of 1000s of nodes
  • LOTS of data (Trillions of records, 100 TB data

What is Hadoop?
  • The Apache Hadoop project develops open-source
    software for reliable, scalable, distributed
  • The project includes
  • Core provides the Hadoop Distributed Filesystem
    (HDFS) and support for the MapReduce distributed
    computing framework.
  • MapReduce A distributed data processing model
    and execution environment that runs on large
    clusters of commodity machines.
  • Chukwa a data collection system for managing
    large distributed systems. Chukwa is built on top
    of the HDFS and MapReduce framework and inherits
    Hadoop's scalability and robustness.

What is Hadoop?
  • HBase builds on Hadoop Core to provide a
    scalable, distributed database.
  • Hive a data warehouse infrastructure built on
    Hadoop Core that provides data summarization,
    adhoc querying and analysis of datasets.
  • Pig a high-level data-flow language and
    execution framework for parallel computation. It
    is build on top of Hadoop Core.
  • ZooKeeper a highly available and reliable
    coordinate system. Distributed applications use
    ZooKeeper to store and mediate updates for
    critical shared state.

Hadoop History
  • 2004 - Initial versions of what is now Hadoop
    Distributed File System and Map-Reduce
    implemented by Doug Cutting Mike Cafarella
  • December 2005 - Nutch ported to the new
    framework. Hadoop runs reliably on 20 nodes.
  • January 2006 - Doug Cutting joins Yahoo!
  • February 2006 - Apache Hadoop project official
    started to support the standalone development of
    Map-Reduce and HDFS.
  • March 2006 - Formation of the Yahoo! Hadoop team
  • April 2006 - Sort benchmark run on 188 nodes in
    47.9 hours

Hadoop History
  • May 2006 - Yahoo sets up a Hadoop research
    cluster - 300 nodes
  • May 2006 - Sort benchmark run on 500 nodes in 42
    hours (better hardware than April benchmark)
  • October 2006 - Research cluster reaches 600 Nodes
  • December 2006 - Sort times 20 nodes in 1.8 hrs,
    100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900
    nodes in 7.8 hrs
  • April 2007 - Research clusters - 2 clusters of
    1000 nodes

Source http//hadoop.openfoundry.org/slides/Hadoo
Hadoop Components
  • Hadoop Distributed Filesystem (HDFS)
  • is a distributed file system designed to run on
    commodity hardware.
  • is highly fault-tolerant and is designed to be
    deployed on low-cost hardware.
  • provides high throughput access to application
    data and is suitable for applications that have
    large data sets.
  • relaxes a few POSIX requirements to enable
    streaming access to file system data. (POSIX
    Portable Operating System Interface for Unix")
  • was originally built as infrastructure for the
    Apache Nutch web search engine project.
  • is part of the Apache Hadoop Core project.

Hadoop Components
  • Hadoop Distributed Filesystem (HDFS)

Source http//hadoop.apache.org/core/
Hadoop Components
  • HDFS Assumptions and Goals
  • Hardware failure
  • Hardware failure is the norm rather than the
  • An HDFS instance may consist of hundreds or
    thousands of server machines, each storing part
    of the file systems data.
  • There are a huge number of components and each
    component has a non-trivial probability of
    failure means that some component of HDFS is
    always non-functional.
  • Therefore, detection of faults and quick,
    automatic recovery from them is a core
    architectural goal of HDFS.

Hadoop Components
  • HDFS Assumptions and Goals
  • Streaming Data Access
  • Applications that run on HDFS need streaming
    access to their data sets.
  • They are not general purpose applications that
    typically run on general purpose file systems.
  • HDFS is designed more for batch processing rather
    than interactive use by users.
  • The emphasis is on high throughput of data access
    rather than low latency of data access. POSIX
    imposes many hard requirements that are not
    needed for applications that are targeted for

Hadoop Components
  • HDFS Assumptions and Goals
  • Large Data Sets
  • Applications that run on HDFS have large data
  • A typical file in HDFS is gigabytes to terabytes
    in size. Thus, HDFS is tuned to support large
  • It should provide high aggregate data bandwidth
    and scale to hundreds of nodes in a single
    cluster. It should support tens of millions of
    files in a single instance.

Hadoop Components
  • HDFS Assumptions and Goals
  • Simple Coherency Model
  • HDFS applications need a write-once-read-many
    access model for files.
  • A file once created, written, and closed need not
    be changed. This assumption simplifies data
    coherency issues and enables high throughput data
  • A Map/Reduce application or a web crawler
    application fits perfectly with this model. There
    is a plan to support appending-writes to files in
    the future.

Hadoop Components
  • HDFS Assumptions and Goals
  • Simple Coherency Model
  • HDFS applications need a write-once-read-many
    access model for files.
  • A file once created, written, and closed need not
    be changed. This assumption simplifies data
    coherency issues and enables high throughput data
  • A Map/Reduce application or a web crawler
    application fits perfectly with this model. There
    is a plan to support appending-writes to files in
    the future.

Hadoop Components
  • HDFS Assumptions and Goals
  • "Moving Computation is Cheaper than Moving Data"
  • A computation requested by an application is much
    more efficient if it is executed near the data it
    operates on. This is especially true when the
    size of the data set is huge.
  • This minimizes network congestion and increases
    the overall throughput of the system.
  • It is often better to migrate the computation
    closer to where the data is located rather than
    moving the data to where the application is
    running. HDFS provides interfaces for
    applications to move themselves closer to where
    the data is located.

Hadoop Components
  • HDFS Assumptions and Goals
  • Portability Across Heterogeneous Hardware and
    Software Platforms
  • HDFS has been designed to be easily portable from
    one platform to another.
  • This facilitates widespread adoption of HDFS as a
    platform of choice for a large set of

Hadoop Components
  • HDFS Namenode and Datanode
  • HDFS has a master/slave architecture
  • An HDFS cluster consists of a single NameNode, a
    master server that manages the file system
    namespace and regulates access to files by
  • In addition, there are a number of DataNodes,
    usually one per node in the cluster, which manage
    storage attached to the nodes that they run on.
  • HDFS exposes a file system namespace and allows
    user data to be stored in files. Internally, a
    file is split into one or more blocks and these
    blocks are stored in a set of DataNodes.

Hadoop Components
  • HDFS Namenode and Datanode
  • HDFS has a master/slave architecture
  • The NameNode executes file system namespace
    operations like opening, closing, and renaming
    files and directories. It also determines the
    mapping of blocks to DataNodes.
  • The DataNodes are responsible for serving read
    and write requests from the file systems
    clients. The DataNodes also perform block
    creation, deletion, and replication upon
    instruction from the NameNode.
  • The existence of a single NameNode in a cluster
    greatly simplifies the architecture of the
    system. The NameNode is the arbitrator and
    repository for all HDFS metadata. The system is
    designed in such a way that user data never flows
    through the NameNode.

Hadoop Components
  • Hadoop Distributed Filesystem (HDFS)

Source http//hadoop.apache.org/common/docs/r0.20
Hadoop Components
  • HDFS The File System Namespace
  • HDFS supports a traditional hierarchical file
    organization. A user or an application can create
    directories and store files inside these
  • The file system namespace hierarchy is similar to
    most other existing file systems.
  • one can create and remove files, move a file from
    one directory to another, or rename a file.
  • The NameNode maintains the file system namespace.
    Any change to the file system namespace or its
    properties is recorded by the NameNode.
  • An application can specify the number of replicas
    of a file that should be maintained by HDFS. The
    number of copies of a file is called the
    replication factor of that file. This information
    is stored by the NameNode.

Hadoop Components
  • Hadoop Distributed Processing Framework
  • Using MapReduce Metaphor
  • Map/Reduce is a software framework for easily
    writing applications which process vast amounts
    of data in-parallel on large clusters of
    commodity hardware.
  • A simple programming model that applies to many
    large-scale computing problems
  • Hide messy details in MapReduce runtime library
  • Automatic parallelization
  • Load balancing
  • Network and disk transformation optimization
  • Handling of machine failures
  • Robustness

Hadoop Components
  • A Map/Reduce job usually splits the input
    data-set into independent chunks which are
    processed by the map tasks in a completely
    parallel manner.
  • The framework sorts the outputs of the maps,
    which are then input to the reduce tasks. The
    framework takes care of scheduling tasks,
    monitoring them and re-executes the failed tasks.
  • The Map/Reduce framework consists of a single
    master JobTracker and one slave TaskTracker per
  • The master is responsible for scheduling the
    jobs' component tasks on the slaves, monitoring
    them and re-executing the failed tasks.
  • The slaves execute the tasks as directed by the

Hadoop Components
  • Although the Hadoop framework is implemented in
    JavaTM, Map/Reduce applications need not be
    written in Java.
  • Hadoop Streaming is a utility which allows users
    to create and run jobs with any executables (e.g.
    shell utilities) as the mapper and/or the
  • Hadoop Pipes is a SWIG- compatible C API to
    implement Map/Reduce applications (non JNITM
    Java Native Interface based).

Hadoop Components
  • MapReduce concepts
  • Definition
  • Map function Take a set of (key, value) pairs
    and generate a set of intermediate (key, value)
    pairs by applying some function to all these
    pairs. Eg., (k1, v1) ? list(k2, v2)
  • Reduce function Merge all pairs with same key
    applying a reduction function on the values.
    E.g., (k2, list(v2)) ? list(k3, v3)
  • Input and Output types of a Map/Reduce job
  • Read a lot of data
  • Map extract something meaningful from each
  • Shuffle and Sort
  • Reduce aggregate, summarize, filter, or
  • Write the results

(input) ltk1, v1gt ? map ? ltk2, v2gt ? combine ?
ltk2, v2gt ? reduce ? ltk3, v3gt (output)
Hadoop Components
  • MapReduce concepts

Hadoop Components
  • Consider the problem of counting the number of
    occurrences of each word in a large collection of

map(String key, String value) // key document
name // value document contents for each word
w in value EmitIntermediate(w, "1")
reduce(String key, Iterator values) // key a
word // values a list of counts int result
0 for each v in values result
ParseInt(v) Emit(AsString(result))
The map function emits each word plus an
associated count of occurrences ("1" in this
The reduce function sums together all the counts
emitted for a particular word.
Hadoop Components
  • MapReduce Execution Overview

1. The MapReduce library in the user program
first shards the input files into M pieces of
typically 16-64 megabytes (MB) per piece. It then
starts up many copies of the program on a cluster
of machines. 2. One of the copies of the program
is special the master. The rest are workers that
are assigned work by the master. There are M map
tasks and R reduce tasks to assign. The master
picks idle workers and assigns each one a map
task or a reduce task. 3. A worker who is
assigned a map task reads the contents of the
corresponding input split. It parses key/value
pairs out of the input data and passes each pair
to the user-defined map function. The
intermediate key/value pairs produced by the map
function are buffered in memory.
Source J. Dean and S. Ghemawat. MapReduce
Simplified Data Processing on Large Clusters.
Communications of the ACM, 51(1)107-113, 2008.
Hadoop Components
  • MapReduce Execution Overview

4. Periodically, the buffered pairs are written
to local disk, partitioned into R regions by the
partitioning function. The locations of these
buffered pairs on the local disk are passed back
to the master, who is responsible for forwarding
these locations to the reduce workers. 5. When a
reduce worker is notified by the master about
these locations, it uses remote procedure calls
to read the buffered data from the local disks of
the map workers. When a reduce worker has read
all intermediate data, it sorts it by the
intermediate keys so that all occurrences of the
same key are grouped together. If the amount of
intermediate data is too large to fit in memory,
an external sort is used.
Source J. Dean and S. Ghemawat. MapReduce
Simplified Data Processing on Large Clusters.
Communications of the ACM, 51(1)107-113, 2008.
Hadoop Components
  • MapReduce Execution Overview

6. The reduce worker iterates over the sorted
intermediate data and for each unique
intermediate key encountered, it passes the key
and the corresponding set of intermediate values
to the user's reduce function. The output of the
reduce function is appended to a final output
file for this reduce partition. 7. When all map
tasks and reduce tasks have been completed, the
master wakes up the user program. At this point,
the MapReduce call in the user program returns
back to the user code.
Source J. Dean and S. Ghemawat. MapReduce
Simplified Data Processing on Large Clusters.
Communications of the ACM, 51(1)107-113, 2008.
Hadoop Components
  • MapReduce Examples
  • Distributed Grep (global search regular
    expression and print out the line)
  • The map function emits a line if it matches a
    given pattern.
  • The reduce function is an identity function that
    just copies the supplied intermediate data to the
  • Count of URL Access Frequency
  • The map function processes logs of web page
    requests and outputs ltURL, 1gt.
  • The reduce function adds together all values for
    the same URL and emits a ltURL, total countgt pair.

Hadoop Components
  • MapReduce Examples
  • Reverse Web-Link Graph
  • The map function outputs lttarget, sourcegt pairs
    for each link to a target URL found in a page
    named "source".
  • The reduce function concatenates the list of all
    source URLs associated with a given target URL
    and emits the pair lttarget, list(source)gt.
  • Inverted Index
  • The map function parses each document, and emits
    a sequence of ltword, document IDgt pairs.
  • The reduce function accepts all pairs for a given
    word, sorts the corresponding document IDs and
    emits a ltword, list(document ID)gt pair.
  • The set of all output pairs forms a simple
    inverted index. It is easy to augment this
    computation to keep track of word positions.

Hadoop Components
  • MapReduce Examples
  • Term-Vector per Host A term vector summarizes
    the most important words that occur in a document
    or a set of documents as a list of ltword,
    frequencygt pairs.
  • The map function emits a lthostname, term vectorgt
    pair for each input document (where the hostname
    is extracted from the URL of the document).
  • The reduce function is passed all per-document
    term vectors for a given host. It adds these term
    vectors together, throwing away infrequent terms,
    and then emits a final lthostname, term vectorgt

  • MapReduce Programs in Google's Source Tree
  • New MapReduce Programs per Month

Source http//www.cs.virginia.edu/pact2006/progr
Who Uses Hadoop
  • Amazon/A9
  • Facebook
  • Google
  • IBM
  • Joost
  • Last.fm
  • New York Times
  • PowerSet (now Microsoft)
  • Quantcast
  • Veoh
  • Yahoo!
  • More at http//wiki.apache.org/hadoop/PoweredBy

Hadoop Resource
  • http//hadoop.apache.org
  • http//developer.yahoo.net/blogs/hadoop/
  • http//code.google.com/intl/zh-TW/edu/submissions/
  • uwspr2007_clustercourse/listing.html
  • http//developer.amazonwebservices.com/connect/ent
  • J. Dean and S. Ghemawat, "MapReduce Simplified
    Data Processing on Large Clusters,"
    Communications of the ACM, 51(1)107-113, 2008.
  • T. White, Hadoop The Definitive Guide (MapReduce
    for the Cloud), O'Reilly, 2009.

Hadoop Download
  • ???? http//ftp.twaren.net/Unix/Web/apache/hadoop/
  • HTTP
  • http//ftp.stut.edu.tw/var/ftp/pub/OpenSource/apac
  • http//ftp.twaren.net/Unix/Web/apache/hadoop/core/
  • http//ftp.mirror.tw/pub/apache/hadoop/core/
  • http//apache.cdpa.nsysu.edu.tw/hadoop/core/
  • http//ftp.tcc.edu.tw/pub/Apache/hadoop/core/
  • http//apache.ntu.edu.tw/hadoop/core/
  • FTP
  • ftp//ftp.stut.edu.tw/pub/OpenSource/apache/hadoop
  • ftp//ftp.stu.edu.tw/Unix/Web/apache/hadoop/core/
  • ftp//ftp.twaren.net/Unix/Web/apache/hadoop/core/
  • ftp//apache.cdpa.nsysu.edu.tw/Unix/Web/apache/had

Hadoop Virtual Imagehttp//code.google.com/intl/z
  • Setting up a Hadoop cluster can be an all day
    job. A virtual machine image has created with a
    preconfigured single node instance of Hadoop.

A virtual machine encapsulates one operating
system within another. (http//developer.yahoo.com
Hadoop Virtual Imagehttp//code.google.com/intl/z
  • While this doesn't have the power of a full
    cluster, it does allow you to use the resources
    on your local machine to explore the Hadoop
  • The virtual machine image is designed to be used
    with the free VMware Player.
  • Hadoop can be run on a single-node in a
    pseudo-distributed mode where each Hadoop daemon
    runs in a separate Java process.

Setting Up the Image
  • The image is packaged as a directory archive. To
    begin set up deflate the image in the directory
    of your choice (you need at least 10GB, the disk
    image can grow to 20GB).
  • The VMware image package contains
  • image.vmx -- The VMware guest OS profile, a
    configuration file that describes the virtual
    machine characteristics (virtual CPU(s), amount
    of memory, etc.).
  • 20GB.vmdk -- A VMware virtual disk used to store
    the content of the virtual machine hard disk
    this file grows as you store data on the virtual
    image. It is configured to store up to 20GB.
  • The archive contains two other files, image.vmsd
    and nvram, these are not critical for running the
    image but are created by the VMware player on
  • As you run the virtual machine log files
    (vmware-x.log) will be created.

Setting Up the Image
  • The system image is based on Ubuntu (version
    7.04) and contains a Java machine (Sun JRE 6 -
    DLJ License v1.1) and the latest Hadoop
    distribution (0.13.0).
  • A new window will appear which will print a
    message indicating the IP address allocated to
    the guest OS. This is the IP address you will use
    to submit jobs from the command line or the
    Eclipse environment.
  • The guest OS contains a running Hadoop
    infrastructure which is configured with
  • A GFS (HDFS) infrastructure using a single data
    node (no replication)
  • A single MapReduce worker

Setting Up the Image
  • The guest OS can be reached from the provided
    console or via SSH using the IP address indicated
    above. Log into the guest OS with
  • guest log in guest, guest password guest
  • administrator log in root, administrator
    password root
  • Once the image is loaded, you can log in with the
    guest account. Hadoop will be installed in the
    guest home directory(/home/guest/hadoop). Three
    scripts are provided for Hadoop maintenance
  • start-hadoop -- Starts file-system and MapReduce
  • stop-hadoop -- Stops all Hadoop daemons.
  • reset-hadoop -- Restarts new Hadoop environment
    with entirely empty file system.

Hadoop 0.20 Install
  • ??
  • ????
  • Install Hadoop
  • ?????????
  • ??java
  • ssh ????
  • ??hadoop
  • Hadoop????

  • Ubuntu ?????????? ,?? Debian ????? GNU/Linux
    ??????????????????? Mark Shuttleworth ????
    Canonical Ltd. ??????????? 04 ?????????? (Ubuntu
    4.10 Warty Warthog),???????????,? 05
    ??????,?????? GNU/Linux ???????????Ubuntu 9.04?
  • ??hadoop?????????????,???????????,?????????classpa

Ubuntu Operating System
  • ????
  • 300MHz ? x86 ???
  • 64MB ????? (LiveCD ??? 256MB ?????????)
  • 4GB ????? (???????????)
  • ???? 640x480 ? VGA ?????
  • ???????
  • ????
  • 700MHz ? x86 ???
  • 384MB ?????
  • 8GB ????? (???????????)
  • ???? 1024x768 ? VGA ?????
  • ???
  • ???????

  • Live CD ubuntu 9.04
  • sun-java-6
  • hadoop 0.20.0

  • ???(user)cfong
  • ??????(user's home directory) /home/cfong
  • ???? (project directory) /home/cfong/workspace
  • hadoop?? /opt/hadoop

Install Hadoop
  • ????????????,???WMware Player???,???Ubuntu
    Operating System?Cent Operating
    System????????Live CD ubuntu 9.04 Operating
  • ???????????????????????

Update Package
  • sudo i
  • ??super user
  • sudo apt-get update
  • updata package lists
  • sudo apt-get upgrade
  • upgrade all install package

Download Package
  • Download hadoop-0.20.0.tar?? /opt/ ?
  • http//apache.cdpa.nsysu.edu.tw/hadoop/core/ha
  • Download Java SE Development Kit (JDK) JDK 6
    Update 14 (jdk-6u10-docs.zip) ?? /tmp/ ?
  • https//cds.sun.com/is-bin/INTERSHOP.enfinity/W

Install Java
  • ??java ????
  • sudo apt-get install java-common
    sun-java6-bin sun-java6-jdk
  • ??sun-java6-doc
  • sudo apt-get install sun-java6-doc

  • apt-get install ssh
  • ssh-keygen -t rsa -P '' -f /.ssh/id_rsa
  • cat /.ssh/id_rsa.pub gtgt /.ssh/authorized_keys
  • ssh localhost

Install Hadoop
  • cd /opt
  • sudo tar -zxvf hadoop-0.20.0.tar.gz
  • sudo chown -R cfongcfong /opt/hadoop-0.20.0
  • sudo ln -sf /opt/hadoop-0.20.0 /opt/hadoop

Environment Variables Set up
  • nano /opt/hadoop/conf/hadoop-env.sh
  • export JAVA_HOME/usr/lib/jvm/java-6-sun
  • export HADOOP_HOME/opt/hadoop
  • export PATHPATH/opt/hadoop/bin

Environment Variables Setup
  • nano /opt/hadoop/conf/core-site.xml
  • ltconfigurationgt
  • ltpropertygt
  • ltnamegtfs.default.namelt/namegt
  • ltvaluegthdfs//localhost9000lt/valuegt
  • lt/propertygt
  • ltpropertygt
  • ltnamegthadoop.tmp.dirlt/namegt
  • ltvaluegt/tmp/hadoop/hadoop-user.namelt/valuegt
  • lt/propertygt
  • lt/configurationgt

Environment Variables Setup
  • nano /opt/hadoop/conf/hdfs-site.xml
  • ltconfigurationgt
  • ltpropertygt
  • ltnamegtdfs.replicationlt/namegt
  • ltvaluegt1lt/valuegt
  • lt/propertygt
  • lt/configurationgt

Environment Variables Setup
  • nano /opt/hadoop/conf/mapred-site.xml
  • ltconfigurationgt
  • ltpropertygt
  • ltnamegtmapred.job.trackerlt/namegt
  • ltvaluegtlocalhost9001lt/valuegt
  • lt/propertygt
  • lt/configurationgt

Environment Variables Setup
  • nano /opt/hadoop/conf/mapred-site.xml
  • ltconfigurationgt
  • ltpropertygt
  • ltnamegtmapred.job.trackerlt/namegt
  • ltvaluegtlocalhost9001lt/valuegt
  • lt/propertygt
  • lt/configurationgt

  • cd /opt/hadoop
  • source /opt/hadoop/conf/hadoop-env.sh
  • hadoop namenode -format
  • start-all.sh
  • hadoop fs -put conf input
  • hadoop fs -ls

Hadoop Examples
  • Example 1
  • cd /opt/hadoop
  • bin/hadoop version

Hadoop 0.20.0 Subversion https//svn.apache.org/re
pos/asf/hadoop/core/branches/branch-0.20 -r
763504 Compiled by ndaley on Thu Apr 9 051840
UTC 200 Compiled by hadoopqa on Thu May 15
072255 UTC 2008
Hadoop Examples
  • Example 2 opt/hadoop/bin/hadoop jar
    hadoop-0.20.0-examples.jar pi 4 10000

Number of Maps 4 Samples per Map 10000 Wrote
input for Map 0 Wrote input for Map 1 Wrote
input for Map 2 Wrote input for Map 3 Starting
Job 09/08/01 065641 INFO mapred.FileInputFormat
Total input paths to process 4 09/08/01
065642 INFO mapred.JobClient Running job
job_200908010505_0002 09/08/01 065643 INFO
mapred.JobClient map 0 reduce 0 09/08/01
065653 INFO mapred.JobClient map 50 reduce
0 09/08/01 065656 INFO mapred.JobClient map
100 reduce 0 09/08/01 065705 INFO
mapred.JobClient map 100 reduce 100 09/08/01
065707 INFO mapred.JobClient Job complete
job_200908010505_0002 09/08/01 065707 INFO
mapred.JobClient Counters 18 09/08/01 065707
INFO mapred.JobClient Job Counters 09/08/01
065707 INFO mapred.JobClient Launched
reduce tasks1 09/08/01 065707 INFO
mapred.JobClient Launched map
tasks4 09/08/01 065707 INFO mapred.JobClient
Data-local map tasks4 09/08/01 065707 INFO
mapred.JobClient FileSystemCounters 09/08/01
065707 INFO mapred.JobClient
FILE_BYTES_READ94 09/08/01 065707 INFO
mapred.JobClient HDFS_BYTES_READ472 09/08/01
065707 INFO mapred.JobClient
FILE_BYTES_WRITTEN334 09/08/01 065707 INFO
mapred.JobClient HDFS_BYTES_WRITTEN215 09/08
/01 065707 INFO mapred.JobClient Map-Reduce
Framework 09/08/01 065707 INFO
mapred.JobClient Reduce input
groups8 09/08/01 065707 INFO mapred.JobClient
Combine output records0 09/08/01 065707
INFO mapred.JobClient Map input
records4 09/08/01 065707 INFO
mapred.JobClient Reduce shuffle
bytes112 09/08/01 065707 INFO
mapred.JobClient Reduce output
records0 09/08/01 065707 INFO
mapred.JobClient Spilled Records16 09/08/01
065707 INFO mapred.JobClient Map output
bytes72 09/08/01 065707 INFO mapred.JobClient
Map input bytes96 09/08/01 065707 INFO
mapred.JobClient Combine input
records0 09/08/01 065707 INFO
mapred.JobClient Map output
records8 09/08/01 065707 INFO
mapred.JobClient Reduce input records8 Job
Finished in 25.84 seconds Estimated value of Pi
is 3.14140000000000000000
Hadoop Examples
  • Example 3 opt/hadoop/bin/start-all.sh

localhost starting datanode, logging to
p.out localhost starting secondarynamenode,
logging to /opt/hadoop/logs/hadoop-root-secondaryn
amenode-cfong-desktop.out starting jobtracker,
logging to /opt/hadoop/logs/hadoop-root-jobtracker
-cfong-desktop.out localhost starting
tasktracker, logging to /opt/hadoop/logs/hadoop-ro
Hadoop Examples
  • Example 4 opt/hadoop/bin/stop-all.sh

stopping jobtracker localhost stopping
tasktracker stopping namenode localhost stopping
datanode localhost stopping secondarynamenode
Hadoop Examples
  • Example 5
  • cd /opt/hadoop
  • jps

20911 JobTracker 20582 DataNode 27281 Jps 20792
SecondaryNameNode 21054 TaskTracker 20474
Hadoop Examples
  • Example 6 sudo netstat -plten grep java

tcp6 0 0 50145
LISTEN 0 141655
20792/java tcp6 0 0 45538
142200 20911/java tcp6 0 0
LISTEN 0 143573
20582/java tcp6 0 0
139970 20474/java tcp6 0 0
LISTEN 0 142203
20911/java tcp6 0 0 50090
143534 20792/java tcp6 0 0
LISTEN 0 140629
20582/java tcp6 0 0 50060
143527 21054/java tcp6 0 0
LISTEN 0 143559
21054/java tcp6 0 0 50030
143441 20911/java tcp6 0 0
LISTEN 0 143141
20474/java tcp6 0 0 50010
143336 20582/java tcp6 0 0
LISTEN 0 143536
20582/java tcp6 0 0 50397
139967 20474/java
About PowerShow.com