Web Scale Crawling with - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Web Scale Crawling with

Description:

Web Scale Crawling with Apache Julien Nioche julien_at_digitalpebble.com Berlin Buzzwords 08/06/11 – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 31

Provided by: berli2

Category:

more less

Transcript and Presenter's Notes

Title: Web Scale Crawling with

1

Web Scale Crawling with

Apache
Julien Nioche julien_at_digitalpebble.com Berlin
Buzzwords 08/06/11
2
DigitalPebble Ltd

Based in Bristol (UK)
Specialised in Text Engineering
Web Crawling
Natural Language Processing
Information Retrieval
Data Mining
Strong focus on Open Source Apache ecosystem
User Contributor Committer
Nutch, SOLR, Lucene
Tika
GATE, UIMA
Mahout
Behemoth

3
Outline

Overview
Features
Data Structures
Use cases
What's new in Nutch 1.3
Nutch 2.0
GORA
Conclusion

4
Nutch?

Distributed framework for large scale web
crawling
but does not have to be large scale at all
or even on the web (file-protocol)

5
Short history

2002/2003 Started By Doug Cutting Mike
Caffarella
2004 sub-project of Lucene _at_Apache
2005 MapReduce implementation in Nutch
2006 Hadoop sub-project of Lucene _at_Apache
2006/7 Parser and MimeType in Tika
2008 Tika sub-project of Lucene _at_Apache
May 2010 TLP project at Apache
June 2011 (?) Nutch 1.3
Q4 2011 (?) Nutch 2.0

6
In a Nutch Shell (1.3)

Step by Step

Inject ? populates CrawlDB from seed list
Generate ? Selects URLS to fetch in segment
Fetch ? Fetches URLs from segment
Parse ? Parses content (text metadata)
UpdateDB ? Updates CrawlDB (new URLs, new
status...)
InvertLinks ? Build Webgraph
SOLRIndex ? Send docs to SOLR
SOLRDedup ? Remove duplicate docs based on
signature

Repeat steps 2 to 8

Or use the all-in-one 'nutch crawl' command

7
Frontier expansion

Manual discovery
Adding new URLs by hand, seeding
Automatic discovery of new resources (frontier
expansion)
Not all outlinks are equally useful - control
Requires content parsing and link extraction

Slide courtesy of A. Bialecki
8
Outline

Overview
Features
Data Structures
Use cases
What's new in Nutch 1.3
Nutch 2.0
GORA
Conclusion

9
An extensible framework

Plugins
Activated with parameter 'plugin.includes'
Implement one or more endpoints

Endpoints
Protocol
Parser
HtmlParseFilter
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter

10
Features

Fetcher
Multi-threaded fetcher
Follows robots.txt
Groups URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more
aggressive

Crawl Strategy
Breadth-first but can be depth-first
Configurable via custom scoring plugins

Scoring
OPIC (On-line Page Importance Calculation) by
default
LinkRank

11
Features (cont.)

Protocols
Http, file, ftp, https

Scheduling
Specified or adaptative

URL filters
Regex, FSA, TLD domain, prefix, suffix
URL normalisers
Default, regex

12
Features (cont.)

Parsing with Apache Tika
But some legacy parsers as well

Other plugins
CreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata

Indexing to SOLR
Bespoke schema

13
Outline

Overview
Features
Data Structures
Use cases
What's new in Nutch 1.3
Nutch 2.0
GORA
Conclusion

14
Data Structures

MapReduce jobs gt I/O Hadoop SequenceMapFiles
CrawlDB gt status of known pages

Input of generate - index
Output of inject - update

15
Data Structures 2

Segment gt round of fetching
Identified by a timestamp

Segment /crawl_generate/ ? SequenceFileltText,Craw
lDatumgt /crawl_fetch/ ? MapFileltText,CrawlDatumgt
/content/ ? MapFileltText,Contentgt /crawl_parse/
? SequenceFileltText,CrawlDatumgt /parse_data/ ?
MapFileltText,ParseDatagt /parse_text/ ?
MapFileltText,ParseTextgt