Nutch in a Nutshell part I - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Nutch in a Nutshell part I

Description:

Nutch as a web crawler. Nutch as a complete web search engine. Special features ... Nutch = Crawler Indexer/Searcher (Lucene) GUI Plugins ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 21
Provided by: zhao97
Category:

less

Transcript and Presenter's Notes

Title: Nutch in a Nutshell part I


1
Nutch in a Nutshell (part I)
  • Presented by
  • Liew Guo Min
  • Zhao Jin

2
Outline
  • Overview
  • Nutch as a web crawler
  • Nutch as a complete web search engine
  • Special features
  • Installation/Usage (with Demo)
  • Exercises

3
Overview
  • Complete web search engine
  • Nutch Crawler Indexer/Searcher (Lucene) GUI
  • Plugins
  • MapReduce Distributed FS (Hadoop)
  • Java based, open source
  • Features
  • Customizable
  • Extensible (Next meeting)
  • Distributed (Next meeting)

4
Nutch as a crawler
Initial URLs
CrawlDB
Webpages/files
update
get
read/write
generate
read/write
Segment
5
Nutch as a complete web search engine
Segments
(Lucene)
Index
(Lucene)
GUI
(Tomcat)
6
Special Features
  • Customizable
  • Configuration files (XML)
  • Required user parameters
  • http.agent.name
  • http.agent.description
  • http.agent.url
  • http.agent.email
  • Adjustable parameters for every component
  • E.g. for fetcher
  • Threads-per-host
  • Threads-per-ip

7
Special Features
  • URL Filters (Text file)
  • Regular expression to filter URLs during crawling
  • E.g.
  • To ignore files with certain suffix
  • -\.(gifexezipico)
  • To accept host in a certain domain
  • http//(a-z0-9\.)apache.org/
  • Plugin-information (XML)
  • The metadata of the plugins (More details next
    week)

8
Installation Usage
  • Installation
  • Software needed
  • Nutch release
  • Java
  • Apache Tomcat (for GUI)
  • Cgywin (for windows)

9
Installation Usage
  • Usage
  • Crawling
  • Initial URLs (text file or DMOZ file)
  • Required parameters (conf/nutch-site.xml)
  • URL filters (conf/crawl-urlfilter.txt)
  • Indexing
  • Automatic
  • Searching
  • Location of files (WAR file, index)
  • The tomcat server

10
Demo time!
11
Exercises
  • Questions
  • What are the things that need to be done before
    starting a crawl job with Nutch?
  • What are the ways tell Nutch what to crawl and
    what not? What can you do if you are the owner of
    a website?
  • Starting from v0.8, Nutch wont run unless some
    minimum user parameters, such as
    http.robots.agents, are set, what do you think is
    the reason behind?
  • What do you think are good crawling behaviors?
  • Do you think an open-sourced search engine like
    Nutch would make it easier for spammers to
    manipulate the search index ranking?
  • What are the advantages of using Nutch instead of
    commercial search engines?

12
Answers
  • What are the things that need to be done before
    starting a crawl job with Nutch?
  • Set the CLASSPATH to the Lucene Core
  • Set the JAVA_HOME path
  • Create a folder containing urls to be crawled
  • Amend the crawl-urlfilter file
  • Amend the nutch-site.xml file to include the user
    parameters

13
  • What are the ways tell Nutch what to crawl and
    what not?
  • Url filters
  • Depth in crawling
  • Scoring function for urls
  • What can you do if you are the owner of a
    website?
  • Web Server Administrators
  • Use the Robot Exclusion Protocol by adding the
    following in /robots.txt
  • HTML Author
  • Add the Robots META tag

14
  • Starting from v0.8, Nutch wont run unless some
    minimum user parameters, such as
    http.robots.agents, are set, what do you think is
    the reason behind?
  • To ensure accountability (although tracing is
    still possible without them)
  • What do you think are good crawling behaviors?
  • Be Accountable
  • Test Locally
  • Don't hog resources
  • Stay with it
  • Share results

15
  • Do you think an open-sourced search engine like
    Nutch would make it easier for spammers to
    manipulate the search index ranking?
  • True but one can always make changes in Nutch to
    minimize the effect.
  • What are the advantages of using Nutch instead of
    commercial search engines?
  • Open-source
  • Transparent
  • Able to define the what are to be returned in
    searches and the index ranking

16
Exercises
  • Hands-on exercises
  • Install Nutch, crawl a few webpages using the
    crawl command and perform a search on it using
    the GUI
  • Repeat the crawling process without using the
    crawl command
  • Modify your configuration to perform each of the
    following crawl jobs and think when they would be
    useful.
  • To crawl only webpages and pdfs but not anything
    else
  • To crawl the files on your harddisk
  • To crawl but not to parse
  • (Challenging) Modify Nutch such that you can
    unpack the crawled files in the segments back
    into their original state

17
QA?
18
Next Meeting
  • Special Features
  • Extensible
  • Distributed
  • Feedback and discussion

19
References
  • http//lucene.apache.org/nutch/ -- Official
    website
  • http//wiki.apache.org/nutch/ -- Nutch wiki
    (Seriously outdated. Take with a grain of salt.)
  • http//lucene.apache.org/nutch/release/ Nutch
    source code
  • www.nutchinstall.blogspot.com Installation guide
  • http//www.robotstxt.org/wc/robots.html The web
    robot pages

20
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com