Title: ???????? --?????????? -- Hadoop -- MapReduce
1????????--?????????? -- Hadoop-- MapReduce
- ???/???
- ????????????
- 2009/08/05
2??????????
- What is large data?
- From the point of view of the infrastructure
required to do analytics, data comes in three
sizes - Small data
- Medium data
- Large data
Source http//blog.rgrossman.com/
3??????????
- Small data
- Small data fits into the memory of a single
machine. - Example a small dataset is the dataset for the
Netflix Prize. (The Netflix Prize seeks to
substantially improve the accuracy of predictions
about how much someone is going to love a movie
based on their movie preferences.) - The Netflix Prize dataset consists of over 100
million movie rating files by 480 thousand
randomly-chosen, anonymous Netflix customers that
rated over 17 thousand movie titles. - This dataset is just 2 GB of data and fits into
the memory of a laptop.
Source http//blog.rgrossman.com/
4??????????
- Medium data
- Medium data fits into a single disk or disk array
and can be managed by a database. - It is becoming common today for companies to
create 1 to 10 TB or large data warehouses.
Source http//blog.rgrossman.com/
5??????????
- Large data
- Large data is so large that it is challenging to
manage it in a database and instead specialized
systems are used. - Scientific experiments, such as the Large Hadron
Collider (LHC, the world's largest and
highest-energy particle accelerator), produce
large datasets. - Log files produced by Google, Yahoo and Microsoft
and similar companies are also examples of large
datasets.
Source http//blog.rgrossman.com/
6??????????
- Large data sources
- Most large datasets were produced by the
scientific and defense communities. - Two things have changed
- Large datasets are now being produced by a third
community companies that provide internet
services, such as search, on-line advertising and
social media. - The ability to analyze these datasets is critical
for advertising systems that produce the bulk of
the revenue for these companies.
Source http//blog.rgrossman.com/
7??????????
- Large data sources
- Two things have changed
- This provides a measure by which to measure the
effectiveness of analytic infrastructure and
analytic models. - Using this metric, Google settled upon analytic
infrastructure that was quite different than the
grid-based infrastructure that is generally used
by the scientific community.
Source http//blog.rgrossman.com/
8??????????
- What is a large data cloud?
- A good working definition is that a large data
cloud provides - storage services and
- compute services that are layered over the
storage services that scale to a data center and
that have the reliability associated with a data
center.
Source http//blog.rgrossman.com/
9??????????
- What are some of the options for working with
large data? - The most mature large data cloud application is
the open source Hadoop system, which consists of
the Hadoop Distributed File System (HDFS) and
Hadoops implementation of MapReduce. - An important advantage of Hadoop is that it has a
very robust community supporting it and there are
a large number of Hadoop projects, including Pig,
which provides simple database-like operations
over data managed by HDFS.
Source http//blog.rgrossman.com/
10??????????
- ????????,???????????
- --???????(ASGC) ????????
- ?????????????,??????????,???????????????
- ????????,????????????????????
- --?????????????????????????????
11??????????
- ?????????,????????????,?????????????
- --?????????????????????????????
- ???? http//www.ithome.com.tw/itadm/article.p
hp?c49410s2
12??????????
????vs.???? ????vs.???? ????vs.????
???? ????
????? ?????(?Google?Yahoo?IBM?Amazon?) ????(?????????CERN????????????????)
????? ????,????????????? ????????????
???? ????,?????Hadoop??,?Google GFS??????BigTable????? ????
???? ?????? ??????????
???????????? ??????????? (?x86???????4GB ????Linux?) ?????????(?????????????????????)
????????? ????????(???????????),??????????????? ?????????????????GB????????
????iThome??,2008?6?
13??????????
- ???? ?????????,???????,????????????,?????????????
??????????,??,?????????????????,???????????????? - ???????,????????????????????,?????????????????????
????? - --?????????????????????????????
- ???? http//www.ithome.com.tw/itadm/article.p
hp?c49410s2
14??????????
- ????(Cloud Computing)Google??????????,???????????
????????,?????????????????(????)??????????????????
? - ????(Grid Computing)????,????????????,???????????
?,?????????????????????? - ?????(In-the-Cloud)?????(Cloud Service)??????????
???,??????????????,????????????????
15??????????
- MapReduce??Google?????????????,??????????????????
?Map??????????????,?????????,???Reduce???????,????
??????? - Hadoop??Java???????????,????Google???????????,???
?????????Google???2006?Yahoo????????????????? - ???? http//www.ithome.com.tw/itadm/article
.php?c49410s2
16??????????
- ?????????????,??????????,??,??,??????????????????
????? - ?????????
- Google? Sawzall ???Yahoo? Pig ??,?????????????????
??? - Google? Sawzall ???MapReduce??,Yahoo? Pig
??Hadoop??(Hadoop?MapReduce?clone),?????????
17Hadoop Why?
- Need to process 100TB datasets with multi-day
jobs - On 1-node
- Scanning at 50 MB/s 23 days
- On 1000 node cluster
- Scanning at 50 MB/s 33 min
- Need framework for distribution
- Efficient, reliable, usable
18Hadoop Where?
- Batch data processing, not real-time/user facing
- Log Processing
- Document Analysis and Indexing
- Web Graphs and Crawling
- Highly parallel, data intensive, distributed
applications - Bandwidth to data is a constraint
- Number of CPUs is a constraint
- Very large production deployments (GRID)
- Several clusters of 1000s of nodes
- LOTS of data (Trillions of records, 100 TB data
sets)
19What is Hadoop?
- The Apache Hadoop project develops open-source
software for reliable, scalable, distributed
computing. - The project includes
- Core provides the Hadoop Distributed Filesystem
(HDFS) and support for the MapReduce distributed
computing framework. - MapReduce A distributed data processing model
and execution environment that runs on large
clusters of commodity machines. - Chukwa a data collection system for managing
large distributed systems. Chukwa is built on top
of the HDFS and MapReduce framework and inherits
Hadoop's scalability and robustness.
20What is Hadoop?
- HBase builds on Hadoop Core to provide a
scalable, distributed database. - Hive a data warehouse infrastructure built on
Hadoop Core that provides data summarization,
adhoc querying and analysis of datasets. - Pig a high-level data-flow language and
execution framework for parallel computation. It
is build on top of Hadoop Core. - ZooKeeper a highly available and reliable
coordinate system. Distributed applications use
ZooKeeper to store and mediate updates for
critical shared state.
21Hadoop History
- 2004 - Initial versions of what is now Hadoop
Distributed File System and Map-Reduce
implemented by Doug Cutting Mike Cafarella - December 2005 - Nutch ported to the new
framework. Hadoop runs reliably on 20 nodes. - January 2006 - Doug Cutting joins Yahoo!
- February 2006 - Apache Hadoop project official
started to support the standalone development of
Map-Reduce and HDFS. - March 2006 - Formation of the Yahoo! Hadoop team
- April 2006 - Sort benchmark run on 188 nodes in
47.9 hours
22Hadoop History
- May 2006 - Yahoo sets up a Hadoop research
cluster - 300 nodes - May 2006 - Sort benchmark run on 500 nodes in 42
hours (better hardware than April benchmark) - October 2006 - Research cluster reaches 600 Nodes
- December 2006 - Sort times 20 nodes in 1.8 hrs,
100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900
nodes in 7.8 hrs - April 2007 - Research clusters - 2 clusters of
1000 nodes
Source http//hadoop.openfoundry.org/slides/Hadoo
p_OSDC_08.pdf
23Hadoop Components
- Hadoop Distributed Filesystem (HDFS)
- is a distributed file system designed to run on
commodity hardware. - is highly fault-tolerant and is designed to be
deployed on low-cost hardware. - provides high throughput access to application
data and is suitable for applications that have
large data sets. - relaxes a few POSIX requirements to enable
streaming access to file system data. (POSIX
Portable Operating System Interface for Unix") - was originally built as infrastructure for the
Apache Nutch web search engine project. - is part of the Apache Hadoop Core project.
24Hadoop Components
- Hadoop Distributed Filesystem (HDFS)
Source http//hadoop.apache.org/core/
25Hadoop Components
- HDFS Assumptions and Goals
- Hardware failure
- Hardware failure is the norm rather than the
exception. - An HDFS instance may consist of hundreds or
thousands of server machines, each storing part
of the file systems data. - There are a huge number of components and each
component has a non-trivial probability of
failure means that some component of HDFS is
always non-functional. - Therefore, detection of faults and quick,
automatic recovery from them is a core
architectural goal of HDFS.
26Hadoop Components
- HDFS Assumptions and Goals
- Streaming Data Access
- Applications that run on HDFS need streaming
access to their data sets. - They are not general purpose applications that
typically run on general purpose file systems. - HDFS is designed more for batch processing rather
than interactive use by users. - The emphasis is on high throughput of data access
rather than low latency of data access. POSIX
imposes many hard requirements that are not
needed for applications that are targeted for
HDFS.
27Hadoop Components
- HDFS Assumptions and Goals
- Large Data Sets
- Applications that run on HDFS have large data
sets. - A typical file in HDFS is gigabytes to terabytes
in size. Thus, HDFS is tuned to support large
files. - It should provide high aggregate data bandwidth
and scale to hundreds of nodes in a single
cluster. It should support tens of millions of
files in a single instance.
28Hadoop Components
- HDFS Assumptions and Goals
- Simple Coherency Model
- HDFS applications need a write-once-read-many
access model for files. - A file once created, written, and closed need not
be changed. This assumption simplifies data
coherency issues and enables high throughput data
access. - A Map/Reduce application or a web crawler
application fits perfectly with this model. There
is a plan to support appending-writes to files in
the future.
29Hadoop Components
- HDFS Assumptions and Goals
- Simple Coherency Model
- HDFS applications need a write-once-read-many
access model for files. - A file once created, written, and closed need not
be changed. This assumption simplifies data
coherency issues and enables high throughput data
access. - A Map/Reduce application or a web crawler
application fits perfectly with this model. There
is a plan to support appending-writes to files in
the future.
30Hadoop Components
- HDFS Assumptions and Goals
- "Moving Computation is Cheaper than Moving Data"
- A computation requested by an application is much
more efficient if it is executed near the data it
operates on. This is especially true when the
size of the data set is huge. - This minimizes network congestion and increases
the overall throughput of the system. - It is often better to migrate the computation
closer to where the data is located rather than
moving the data to where the application is
running. HDFS provides interfaces for
applications to move themselves closer to where
the data is located.
31Hadoop Components
- HDFS Assumptions and Goals
- Portability Across Heterogeneous Hardware and
Software Platforms - HDFS has been designed to be easily portable from
one platform to another. - This facilitates widespread adoption of HDFS as a
platform of choice for a large set of
applications.
32Hadoop Components
- HDFS Namenode and Datanode
- HDFS has a master/slave architecture
- An HDFS cluster consists of a single NameNode, a
master server that manages the file system
namespace and regulates access to files by
clients. - In addition, there are a number of DataNodes,
usually one per node in the cluster, which manage
storage attached to the nodes that they run on. - HDFS exposes a file system namespace and allows
user data to be stored in files. Internally, a
file is split into one or more blocks and these
blocks are stored in a set of DataNodes.
33Hadoop Components
- HDFS Namenode and Datanode
- HDFS has a master/slave architecture
- The NameNode executes file system namespace
operations like opening, closing, and renaming
files and directories. It also determines the
mapping of blocks to DataNodes. - The DataNodes are responsible for serving read
and write requests from the file systems
clients. The DataNodes also perform block
creation, deletion, and replication upon
instruction from the NameNode. - The existence of a single NameNode in a cluster
greatly simplifies the architecture of the
system. The NameNode is the arbitrator and
repository for all HDFS metadata. The system is
designed in such a way that user data never flows
through the NameNode.
34Hadoop Components
- Hadoop Distributed Filesystem (HDFS)
Source http//hadoop.apache.org/common/docs/r0.20
.0/hdfs_design.html
35Hadoop Components
- HDFS The File System Namespace
- HDFS supports a traditional hierarchical file
organization. A user or an application can create
directories and store files inside these
directories. - The file system namespace hierarchy is similar to
most other existing file systems. - one can create and remove files, move a file from
one directory to another, or rename a file. - The NameNode maintains the file system namespace.
Any change to the file system namespace or its
properties is recorded by the NameNode. - An application can specify the number of replicas
of a file that should be maintained by HDFS. The
number of copies of a file is called the
replication factor of that file. This information
is stored by the NameNode.
36Hadoop Components
- Hadoop Distributed Processing Framework
- Using MapReduce Metaphor
- Map/Reduce is a software framework for easily
writing applications which process vast amounts
of data in-parallel on large clusters of
commodity hardware. - A simple programming model that applies to many
large-scale computing problems - Hide messy details in MapReduce runtime library
- Automatic parallelization
- Load balancing
- Network and disk transformation optimization
- Handling of machine failures
- Robustness
37Hadoop Components
- A Map/Reduce job usually splits the input
data-set into independent chunks which are
processed by the map tasks in a completely
parallel manner. - The framework sorts the outputs of the maps,
which are then input to the reduce tasks. The
framework takes care of scheduling tasks,
monitoring them and re-executes the failed tasks.
- The Map/Reduce framework consists of a single
master JobTracker and one slave TaskTracker per
cluster-node. - The master is responsible for scheduling the
jobs' component tasks on the slaves, monitoring
them and re-executing the failed tasks. - The slaves execute the tasks as directed by the
master.
38Hadoop Components
- Although the Hadoop framework is implemented in
JavaTM, Map/Reduce applications need not be
written in Java. - Hadoop Streaming is a utility which allows users
to create and run jobs with any executables (e.g.
shell utilities) as the mapper and/or the
reducer. - Hadoop Pipes is a SWIG- compatible C API to
implement Map/Reduce applications (non JNITM
Java Native Interface based).
39Hadoop Components
- MapReduce concepts
- Definition
- Map function Take a set of (key, value) pairs
and generate a set of intermediate (key, value)
pairs by applying some function to all these
pairs. Eg., (k1, v1) ? list(k2, v2) - Reduce function Merge all pairs with same key
applying a reduction function on the values.
E.g., (k2, list(v2)) ? list(k3, v3) - Input and Output types of a Map/Reduce job
- Read a lot of data
- Map extract something meaningful from each
record - Shuffle and Sort
- Reduce aggregate, summarize, filter, or
transform - Write the results
(input) ltk1, v1gt ? map ? ltk2, v2gt ? combine ?
ltk2, v2gt ? reduce ? ltk3, v3gt (output)
40Hadoop Components
41Hadoop Components
- Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
map(String key, String value) // key document
name // value document contents for each word
w in value EmitIntermediate(w, "1")
reduce(String key, Iterator values) // key a
word // values a list of counts int result
0 for each v in values result
ParseInt(v) Emit(AsString(result))
The map function emits each word plus an
associated count of occurrences ("1" in this
example).
The reduce function sums together all the counts
emitted for a particular word.
42Hadoop Components
- MapReduce Execution Overview
1. The MapReduce library in the user program
first shards the input files into M pieces of
typically 16-64 megabytes (MB) per piece. It then
starts up many copies of the program on a cluster
of machines. 2. One of the copies of the program
is special the master. The rest are workers that
are assigned work by the master. There are M map
tasks and R reduce tasks to assign. The master
picks idle workers and assigns each one a map
task or a reduce task. 3. A worker who is
assigned a map task reads the contents of the
corresponding input split. It parses key/value
pairs out of the input data and passes each pair
to the user-defined map function. The
intermediate key/value pairs produced by the map
function are buffered in memory.
Source J. Dean and S. Ghemawat. MapReduce
Simplified Data Processing on Large Clusters.
Communications of the ACM, 51(1)107-113, 2008.
43Hadoop Components
- MapReduce Execution Overview
4. Periodically, the buffered pairs are written
to local disk, partitioned into R regions by the
partitioning function. The locations of these
buffered pairs on the local disk are passed back
to the master, who is responsible for forwarding
these locations to the reduce workers. 5. When a
reduce worker is notified by the master about
these locations, it uses remote procedure calls
to read the buffered data from the local disks of
the map workers. When a reduce worker has read
all intermediate data, it sorts it by the
intermediate keys so that all occurrences of the
same key are grouped together. If the amount of
intermediate data is too large to fit in memory,
an external sort is used.
Source J. Dean and S. Ghemawat. MapReduce
Simplified Data Processing on Large Clusters.
Communications of the ACM, 51(1)107-113, 2008.
44Hadoop Components
- MapReduce Execution Overview
6. The reduce worker iterates over the sorted
intermediate data and for each unique
intermediate key encountered, it passes the key
and the corresponding set of intermediate values
to the user's reduce function. The output of the
reduce function is appended to a final output
file for this reduce partition. 7. When all map
tasks and reduce tasks have been completed, the
master wakes up the user program. At this point,
the MapReduce call in the user program returns
back to the user code.
Source J. Dean and S. Ghemawat. MapReduce
Simplified Data Processing on Large Clusters.
Communications of the ACM, 51(1)107-113, 2008.
45Hadoop Components
- MapReduce Examples
- Distributed Grep (global search regular
expression and print out the line) - The map function emits a line if it matches a
given pattern. - The reduce function is an identity function that
just copies the supplied intermediate data to the
output. - Count of URL Access Frequency
- The map function processes logs of web page
requests and outputs ltURL, 1gt. - The reduce function adds together all values for
the same URL and emits a ltURL, total countgt pair.
46Hadoop Components
- MapReduce Examples
- Reverse Web-Link Graph
- The map function outputs lttarget, sourcegt pairs
for each link to a target URL found in a page
named "source". - The reduce function concatenates the list of all
source URLs associated with a given target URL
and emits the pair lttarget, list(source)gt. - Inverted Index
- The map function parses each document, and emits
a sequence of ltword, document IDgt pairs. - The reduce function accepts all pairs for a given
word, sorts the corresponding document IDs and
emits a ltword, list(document ID)gt pair. - The set of all output pairs forms a simple
inverted index. It is easy to augment this
computation to keep track of word positions.
47Hadoop Components
- MapReduce Examples
- Term-Vector per Host A term vector summarizes
the most important words that occur in a document
or a set of documents as a list of ltword,
frequencygt pairs. - The map function emits a lthostname, term vectorgt
pair for each input document (where the hostname
is extracted from the URL of the document). - The reduce function is passed all per-document
term vectors for a given host. It adds these term
vectors together, throwing away infrequent terms,
and then emits a final lthostname, term vectorgt
pair.
48- MapReduce Programs in Google's Source Tree
- New MapReduce Programs per Month
Source http//www.cs.virginia.edu/pact2006/progr
am/mapreduce-pact06-keynote.pdf
49Who Uses Hadoop
- Amazon/A9
- Facebook
- Google
- IBM
- Joost
- Last.fm
- New York Times
- PowerSet (now Microsoft)
- Quantcast
- Veoh
- Yahoo!
- More at http//wiki.apache.org/hadoop/PoweredBy
50Hadoop Resource
- http//hadoop.apache.org
- http//developer.yahoo.net/blogs/hadoop/
- http//code.google.com/intl/zh-TW/edu/submissions/
- uwspr2007_clustercourse/listing.html
- http//developer.amazonwebservices.com/connect/ent
ry.jspa?externalID873 - J. Dean and S. Ghemawat, "MapReduce Simplified
Data Processing on Large Clusters,"
Communications of the ACM, 51(1)107-113, 2008. - T. White, Hadoop The Definitive Guide (MapReduce
for the Cloud), O'Reilly, 2009.
51Hadoop Download
- ???? http//ftp.twaren.net/Unix/Web/apache/hadoop/
core/ - HTTP
- http//ftp.stut.edu.tw/var/ftp/pub/OpenSource/apac
he/hadoop/core/ - http//ftp.twaren.net/Unix/Web/apache/hadoop/core/
- http//ftp.mirror.tw/pub/apache/hadoop/core/
- http//apache.cdpa.nsysu.edu.tw/hadoop/core/
- http//ftp.tcc.edu.tw/pub/Apache/hadoop/core/
- http//apache.ntu.edu.tw/hadoop/core/
- FTP
- ftp//ftp.stut.edu.tw/pub/OpenSource/apache/hadoop
/core/ - ftp//ftp.stu.edu.tw/Unix/Web/apache/hadoop/core/
- ftp//ftp.twaren.net/Unix/Web/apache/hadoop/core/
- ftp//apache.cdpa.nsysu.edu.tw/Unix/Web/apache/had
oop/core/
52Hadoop Virtual Imagehttp//code.google.com/intl/z
h-TW/edu/parallel/tools/hadoopvm/
- Setting up a Hadoop cluster can be an all day
job. A virtual machine image has created with a
preconfigured single node instance of Hadoop.
A virtual machine encapsulates one operating
system within another. (http//developer.yahoo.com
/hadoop/tutorial/module3.html)
53Hadoop Virtual Imagehttp//code.google.com/intl/z
h-TW/edu/parallel/tools/hadoopvm/
- While this doesn't have the power of a full
cluster, it does allow you to use the resources
on your local machine to explore the Hadoop
platform. - The virtual machine image is designed to be used
with the free VMware Player. - Hadoop can be run on a single-node in a
pseudo-distributed mode where each Hadoop daemon
runs in a separate Java process.
54Setting Up the Image
- The image is packaged as a directory archive. To
begin set up deflate the image in the directory
of your choice (you need at least 10GB, the disk
image can grow to 20GB). - The VMware image package contains
- image.vmx -- The VMware guest OS profile, a
configuration file that describes the virtual
machine characteristics (virtual CPU(s), amount
of memory, etc.). - 20GB.vmdk -- A VMware virtual disk used to store
the content of the virtual machine hard disk
this file grows as you store data on the virtual
image. It is configured to store up to 20GB. - The archive contains two other files, image.vmsd
and nvram, these are not critical for running the
image but are created by the VMware player on
startup. - As you run the virtual machine log files
(vmware-x.log) will be created.
55Setting Up the Image
- The system image is based on Ubuntu (version
7.04) and contains a Java machine (Sun JRE 6 -
DLJ License v1.1) and the latest Hadoop
distribution (0.13.0). - A new window will appear which will print a
message indicating the IP address allocated to
the guest OS. This is the IP address you will use
to submit jobs from the command line or the
Eclipse environment. - The guest OS contains a running Hadoop
infrastructure which is configured with - A GFS (HDFS) infrastructure using a single data
node (no replication) - A single MapReduce worker
56Setting Up the Image
- The guest OS can be reached from the provided
console or via SSH using the IP address indicated
above. Log into the guest OS with - guest log in guest, guest password guest
- administrator log in root, administrator
password root - Once the image is loaded, you can log in with the
guest account. Hadoop will be installed in the
guest home directory(/home/guest/hadoop). Three
scripts are provided for Hadoop maintenance
purposes - start-hadoop -- Starts file-system and MapReduce
daemons. - stop-hadoop -- Stops all Hadoop daemons.
- reset-hadoop -- Restarts new Hadoop environment
with entirely empty file system.
57Hadoop 0.20 Install
- ??
- ????
- Install Hadoop
- ?????????
- ??java
- ssh ????
- ??hadoop
- Hadoop????
58??
- Ubuntu ?????????? ,?? Debian ????? GNU/Linux
??????????????????? Mark Shuttleworth ????
Canonical Ltd. ??????????? 04 ?????????? (Ubuntu
4.10 Warty Warthog),???????????,? 05
??????,?????? GNU/Linux ???????????Ubuntu 9.04? - ??hadoop?????????????,???????????,?????????classpa
th?
59Ubuntu Operating System
- ????
- 300MHz ? x86 ???
- 64MB ????? (LiveCD ??? 256MB ?????????)
- 4GB ????? (???????????)
- ???? 640x480 ? VGA ?????
- ???????
- ????
- 700MHz ? x86 ???
- 384MB ?????
- 8GB ????? (???????????)
- ???? 1024x768 ? VGA ?????
- ???
- ???????
60????
- Live CD ubuntu 9.04
- sun-java-6
- hadoop 0.20.0
61????
- ???(user)cfong
- ??????(user's home directory) /home/cfong
- ???? (project directory) /home/cfong/workspace
- hadoop?? /opt/hadoop
62Install Hadoop
- ????????????,???WMware Player???,???Ubuntu
Operating System?Cent Operating
System????????Live CD ubuntu 9.04 Operating
System???? - ???????????????????????
63Update Package
- sudo i
- ??super user
- sudo apt-get update
- updata package lists
- sudo apt-get upgrade
- upgrade all install package
64Download Package
- Download hadoop-0.20.0.tar?? /opt/ ?
- http//apache.cdpa.nsysu.edu.tw/hadoop/core/ha
doop-0.20.0/hadoop-0.20.0.tar.gz - Download Java SE Development Kit (JDK) JDK 6
Update 14 (jdk-6u10-docs.zip) ?? /tmp/ ? - https//cds.sun.com/is-bin/INTERSHOP.enfinity/W
FS/CDS-CDS_Developer-Site/en_US/-/USD/ViewProductD
etail-Start?ProductRefjdk-6u10-docs-oth-JPR_at_CDS-C
DS_Developer
65Install Java
- ??java ????
- sudo apt-get install java-common
sun-java6-bin sun-java6-jdk - ??sun-java6-doc
- sudo apt-get install sun-java6-doc
66SSH????
- apt-get install ssh
- ssh-keygen -t rsa -P '' -f /.ssh/id_rsa
- cat /.ssh/id_rsa.pub gtgt /.ssh/authorized_keys
- ssh localhost
67Install Hadoop
- cd /opt
- sudo tar -zxvf hadoop-0.20.0.tar.gz
- sudo chown -R cfongcfong /opt/hadoop-0.20.0
- sudo ln -sf /opt/hadoop-0.20.0 /opt/hadoop
68Environment Variables Set up
- nano /opt/hadoop/conf/hadoop-env.sh
- export JAVA_HOME/usr/lib/jvm/java-6-sun
- export HADOOP_HOME/opt/hadoop
- export PATHPATH/opt/hadoop/bin
69Environment Variables Setup
- nano /opt/hadoop/conf/core-site.xml
- ltconfigurationgt
- ltpropertygt
- ltnamegtfs.default.namelt/namegt
- ltvaluegthdfs//localhost9000lt/valuegt
- lt/propertygt
- ltpropertygt
- ltnamegthadoop.tmp.dirlt/namegt
- ltvaluegt/tmp/hadoop/hadoop-user.namelt/valuegt
- lt/propertygt
- lt/configurationgt
70Environment Variables Setup
- nano /opt/hadoop/conf/hdfs-site.xml
- ltconfigurationgt
- ltpropertygt
- ltnamegtdfs.replicationlt/namegt
- ltvaluegt1lt/valuegt
- lt/propertygt
- lt/configurationgt
71Environment Variables Setup
- nano /opt/hadoop/conf/mapred-site.xml
- ltconfigurationgt
- ltpropertygt
- ltnamegtmapred.job.trackerlt/namegt
- ltvaluegtlocalhost9001lt/valuegt
- lt/propertygt
- lt/configurationgt
72Environment Variables Setup
- nano /opt/hadoop/conf/mapred-site.xml
- ltconfigurationgt
- ltpropertygt
- ltnamegtmapred.job.trackerlt/namegt
- ltvaluegtlocalhost9001lt/valuegt
- lt/propertygt
- lt/configurationgt
73??Hadoop
- cd /opt/hadoop
- source /opt/hadoop/conf/hadoop-env.sh
- hadoop namenode -format
- start-all.sh
- hadoop fs -put conf input
- hadoop fs -ls
74Hadoop Examples
- Example 1
- cd /opt/hadoop
- bin/hadoop version
Hadoop 0.20.0 Subversion https//svn.apache.org/re
pos/asf/hadoop/core/branches/branch-0.20 -r
763504 Compiled by ndaley on Thu Apr 9 051840
UTC 200 Compiled by hadoopqa on Thu May 15
072255 UTC 2008
75Hadoop Examples
- Example 2 opt/hadoop/bin/hadoop jar
hadoop-0.20.0-examples.jar pi 4 10000
Number of Maps 4 Samples per Map 10000 Wrote
input for Map 0 Wrote input for Map 1 Wrote
input for Map 2 Wrote input for Map 3 Starting
Job 09/08/01 065641 INFO mapred.FileInputFormat
Total input paths to process 4 09/08/01
065642 INFO mapred.JobClient Running job
job_200908010505_0002 09/08/01 065643 INFO
mapred.JobClient map 0 reduce 0 09/08/01
065653 INFO mapred.JobClient map 50 reduce
0 09/08/01 065656 INFO mapred.JobClient map
100 reduce 0 09/08/01 065705 INFO
mapred.JobClient map 100 reduce 100 09/08/01
065707 INFO mapred.JobClient Job complete
job_200908010505_0002 09/08/01 065707 INFO
mapred.JobClient Counters 18 09/08/01 065707
INFO mapred.JobClient Job Counters 09/08/01
065707 INFO mapred.JobClient Launched
reduce tasks1 09/08/01 065707 INFO
mapred.JobClient Launched map
tasks4 09/08/01 065707 INFO mapred.JobClient
Data-local map tasks4 09/08/01 065707 INFO
mapred.JobClient FileSystemCounters 09/08/01
065707 INFO mapred.JobClient
FILE_BYTES_READ94 09/08/01 065707 INFO
mapred.JobClient HDFS_BYTES_READ472 09/08/01
065707 INFO mapred.JobClient
FILE_BYTES_WRITTEN334 09/08/01 065707 INFO
mapred.JobClient HDFS_BYTES_WRITTEN215 09/08
/01 065707 INFO mapred.JobClient Map-Reduce
Framework 09/08/01 065707 INFO
mapred.JobClient Reduce input
groups8 09/08/01 065707 INFO mapred.JobClient
Combine output records0 09/08/01 065707
INFO mapred.JobClient Map input
records4 09/08/01 065707 INFO
mapred.JobClient Reduce shuffle
bytes112 09/08/01 065707 INFO
mapred.JobClient Reduce output
records0 09/08/01 065707 INFO
mapred.JobClient Spilled Records16 09/08/01
065707 INFO mapred.JobClient Map output
bytes72 09/08/01 065707 INFO mapred.JobClient
Map input bytes96 09/08/01 065707 INFO
mapred.JobClient Combine input
records0 09/08/01 065707 INFO
mapred.JobClient Map output
records8 09/08/01 065707 INFO
mapred.JobClient Reduce input records8 Job
Finished in 25.84 seconds Estimated value of Pi
is 3.14140000000000000000
76Hadoop Examples
- Example 3 opt/hadoop/bin/start-all.sh
localhost starting datanode, logging to
/opt/hadoop/logs/hadoop-root-datanode-cfong-deskto
p.out localhost starting secondarynamenode,
logging to /opt/hadoop/logs/hadoop-root-secondaryn
amenode-cfong-desktop.out starting jobtracker,
logging to /opt/hadoop/logs/hadoop-root-jobtracker
-cfong-desktop.out localhost starting
tasktracker, logging to /opt/hadoop/logs/hadoop-ro
ot-tasktracker-cfong-desktop.out
77Hadoop Examples
- Example 4 opt/hadoop/bin/stop-all.sh
stopping jobtracker localhost stopping
tasktracker stopping namenode localhost stopping
datanode localhost stopping secondarynamenode
78Hadoop Examples
- Example 5
- cd /opt/hadoop
- jps
20911 JobTracker 20582 DataNode 27281 Jps 20792
SecondaryNameNode 21054 TaskTracker 20474
NameNode
79Hadoop Examples
- Example 6 sudo netstat -plten grep java
tcp6 0 0 50145
LISTEN 0 141655
20792/java tcp6 0 0 45538
LISTEN 0
142200 20911/java tcp6 0 0
50020
LISTEN 0 143573
20582/java tcp6 0 0 127.0.0.19000
LISTEN 0
139970 20474/java tcp6 0 0
127.0.0.19001
LISTEN 0 142203
20911/java tcp6 0 0 50090
LISTEN 0
143534 20792/java tcp6 0 0
53866
LISTEN 0 140629
20582/java tcp6 0 0 50060
LISTEN 0
143527 21054/java tcp6 0 0
127.0.0.137870
LISTEN 0 143559
21054/java tcp6 0 0 50030
LISTEN 0
143441 20911/java tcp6 0 0
50070
LISTEN 0 143141
20474/java tcp6 0 0 50010
LISTEN 0
143336 20582/java tcp6 0 0
50075
LISTEN 0 143536
20582/java tcp6 0 0 50397
LISTEN 0
139967 20474/java