Nutch in a Nutshell - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Nutch in a Nutshell

Description:

Java based, open source. Features: Customizable. Extensible. Distributed. Nutch as a crawler ... http://lucene.apache.org/hadoop/ -- Hadoop homepage ... – PowerPoint PPT presentation

Number of Views:437
Avg rating:3.0/5.0
Slides: 21
Provided by: zhao97
Category:

less

Transcript and Presenter's Notes

Title: Nutch in a Nutshell


1
Nutch in a Nutshell
  • Presented by
  • Liew Guo Min
  • Zhao Jin

2
Outline
  • Recap
  • Special features
  • Running Nutch in a distributed environment (with
    demo)
  • QA
  • Discussion

3
Recap
  • Complete web search engine
  • Nutch Crawler Indexer/Searcher (Lucene) GUI
  • Plugins
  • MapReduce Distributed FS (Hadoop)
  • Java based, open source
  • Features
  • Customizable
  • Extensible
  • Distributed

4
Nutch as a crawler
Initial URLs
CrawlDB
Webpages/files
update
get
read/write
generate
read/write
Segment
5
Special Features
  • Extensible (Plugin system)
  • Most of the essential functionalities of Nutch
    are implemented as plugins
  • Three layers
  • Extension points
  • What can be extended Protocol, Parser,
    ScoringFilter, etc.
  • Extensions
  • The interfaces to be implemented for the
    extension points
  • Plugins
  • The actual implementation

6
Special Features
  • Extensible (Plugin system)
  • Anyone can write a plugin
  • Write the code
  • Prepare metadata files
  • Plugin.xml what has been extended by what
  • Build.xml how ant can build your source code
  • Ask nutch to include your plugin in
    conf/nutch-site.xml
  • Tell ant to build your in src/plugin/build.xml
  • More details _at_ http//wiki.apache.org/nutch/Plugin
    Central

7
Special Features
  • Extensible (Plugin system)
  • To use a plugin
  • Make sure you have modified Nutch-site.xml to
    include the plugin
  • Then, either
  • Nutch would automatically call it when needed, or
  • You can write something to call it with its
    classname and then use it

8
Special Features
  • Distributed (Hadoop)
  • Map-Reduce (Diagram)
  • A framework for distributed programming
  • Map -- Process the splits of data to get
    intermediate results and the keys to indicate
    what should be put together later
  • Reduce -- Process the intermediate results with
    the same key and output final result

9
Special Features
  • Distributed (Hadoop)
  • MapReduce in Nutch
  • Example1 Parsing
  • Input lturl, contentgt files from fetch
  • Map(url,content) ? lturl, parsegt by calling parser
    plugins
  • Reduce is identity
  • Example2 Dumping a segment
  • Input lturl, CrawlDatumgt, lturl, ParseTextgt etc.
    files from segment
  • Map is identity
  • Reduce(url, value) ? lturl, ConcatenatedValuegt by
    simply concatenating the text representation of
    values

10
Special Features
  • Distributed (Hadoop)
  • Distributed File system
  • Write-once-read-many coherence model
  • High throughput
  • Master/slave
  • Simple architecture
  • Single point of failure
  • Transparent
  • Access via Java API
  • More info _at_ http//lucene.apache.org/hadoop/hdfs_d
    esign.html

11
Running Nutch in a distributed environment
  • MapReduce
  • In hadoop-site.xml
  • Specify job tracker host port
  • mapred.job.tracker
  • Specify task numbers
  • mapred.map.tasks
  • mapred.reduce.tasks
  • Specify location for temporary files
  • Mapred.local.dir

12
Running Nutch in a distributed environment
  • DFS
  • In hadoop-site.xml
  • Specify namenode host, port directory
  • fs.default.name
  • dfs.name.dir
  • Specify location for files on each datanode
  • dfs.data.dir

13
Demo time!
14
QA
15
Discussion
16
Exercises
  • Hands-on exercises
  • Install Nutch, crawl a few webpages using the
    crawl command and perform a search on it using
    the GUI
  • Repeat the crawling process without using the
    crawl command
  • Modify your configuration to perform each of the
    following crawl jobs and think when they would be
    useful.
  • To crawl only webpages and pdfs but not anything
    else
  • To crawl the files on your harddisk
  • To crawl but not to parse
  • (Challenging) Modify Nutch such that you can
    unpack the crawled files in the segments back
    into their original state

17
Reference
  • http//wiki.apache.org/nutch/PluginCentral --
    Information on Nutch plugins
  • http//lucene.apache.org/hadoop/ -- Hadoop
    homepage
  • http//wiki.apache.org/lucene-hadoop/ -- Hadoop
    Wiki
  • http//wiki.apache.org/nutch-data/attachments/Pres
    entations/attachments/mapred.pdf "MapReduce in
    Nutch"
  • http//wiki.apache.org/nutch-data/attachments/Pres
    entations/attachments/oscon05.pdf "Scalable
    Computing with MapReduce
  • http//www.mail-archive.com/nutch-commits_at_lucene.a
    pache.org/msg01951.html Updated tutorial on
    setting up Nutch, Hadoop and Lucene together

18
Excursion MapReduce
  • Problem
  • Find the number of occurrences of cat in a file
  • What if the file is 20GB large?
  • Why not do it with more computers?
  • Solution

PC1
200
PC1
500
Split 1
Split 2
PC2
300
19
Excursion MapReduce
  • Problem
  • Find the number of occurrences of both cat and
    dog in a very large file
  • Solution

200, 250
cat 200, dog 250
cat 200, 300
PC1
PC1
cat500
Split 1
Split 2
PC2
cat 300, dog 250
dog 250, 250
300, 250
PC2
dog500
Map
Reduce
Sort/Group
Input Files
Intermediate files
Output files
20
Excursion MapReduce
  • Generalized Framework

Master
k1v1 k3v2
k1v1,v2
Split 1
Worker
Worker
Output 1
Split 2
k2v4,v5
k1v3 k2v4
Worker
Worker
Output 2
Split 3
k3v2
Split 4
Worker
Worker
Output 3
k2v5 k4v6
k4v6
Map
Reduce
Sort/Group
Input Files
Intermediate files
Output files
back
Write a Comment
User Comments (0)
About PowerShow.com