Web Scale Crawling with - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Web Scale Crawling with

Description:

Web Scale Crawling with Apache Julien Nioche julien_at_digitalpebble.com Berlin Buzzwords 08/06/11 – PowerPoint PPT presentation

Number of Views:172
Avg rating:3.0/5.0
Slides: 31
Provided by: berli2
Category:
Tags: hbase | crawling | hadoop | scale | web

less

Transcript and Presenter's Notes

Title: Web Scale Crawling with


1
  • Web Scale Crawling with

Apache
Julien Nioche julien_at_digitalpebble.com Berlin
Buzzwords 08/06/11
2
DigitalPebble Ltd
  • Based in Bristol (UK)
  • Specialised in Text Engineering
  • Web Crawling
  • Natural Language Processing
  • Information Retrieval
  • Data Mining
  • Strong focus on Open Source Apache ecosystem
  • User Contributor Committer
  • Nutch, SOLR, Lucene
  • Tika
  • GATE, UIMA
  • Mahout
  • Behemoth

3
Outline
  • Overview
  • Features
  • Data Structures
  • Use cases
  • What's new in Nutch 1.3
  • Nutch 2.0
  • GORA
  • Conclusion

4
Nutch?
  • Distributed framework for large scale web
    crawling
  • but does not have to be large scale at all
  • or even on the web (file-protocol)

5
Short history
  • 2002/2003 Started By Doug Cutting Mike
    Caffarella
  • 2004 sub-project of Lucene _at_Apache
  • 2005 MapReduce implementation in Nutch
  • 2006 Hadoop sub-project of Lucene _at_Apache
  • 2006/7 Parser and MimeType in Tika
  • 2008 Tika sub-project of Lucene _at_Apache
  • May 2010 TLP project at Apache
  • June 2011 (?) Nutch 1.3
  • Q4 2011 (?) Nutch 2.0

6
In a Nutch Shell (1.3)
  • Step by Step
  • Inject ? populates CrawlDB from seed list
  • Generate ? Selects URLS to fetch in segment
  • Fetch ? Fetches URLs from segment
  • Parse ? Parses content (text metadata)
  • UpdateDB ? Updates CrawlDB (new URLs, new
    status...)
  • InvertLinks ? Build Webgraph
  • SOLRIndex ? Send docs to SOLR
  • SOLRDedup ? Remove duplicate docs based on
    signature
  • Repeat steps 2 to 8
  • Or use the all-in-one 'nutch crawl' command

7
Frontier expansion
  • Manual discovery
  • Adding new URLs by hand, seeding
  • Automatic discovery of new resources (frontier
    expansion)
  • Not all outlinks are equally useful - control
  • Requires content parsing and link extraction

Slide courtesy of A. Bialecki
8
Outline
  • Overview
  • Features
  • Data Structures
  • Use cases
  • What's new in Nutch 1.3
  • Nutch 2.0
  • GORA
  • Conclusion

9
An extensible framework
  • Plugins
  • Activated with parameter 'plugin.includes'
  • Implement one or more endpoints
  • Endpoints
  • Protocol
  • Parser
  • HtmlParseFilter
  • ScoringFilter (used in various places)
  • URLFilter (ditto)
  • URLNormalizer (ditto)
  • IndexingFilter

10
Features
  • Fetcher
  • Multi-threaded fetcher
  • Follows robots.txt
  • Groups URLs per hostname / domain / IP
  • Limit the number of URLs for round of fetching
  • Default values are polite but can be made more
    aggressive
  • Crawl Strategy
  • Breadth-first but can be depth-first
  • Configurable via custom scoring plugins
  • Scoring
  • OPIC (On-line Page Importance Calculation) by
    default
  • LinkRank

11
Features (cont.)
  • Protocols
  • Http, file, ftp, https
  • Scheduling
  • Specified or adaptative
  • URL filters
  • Regex, FSA, TLD domain, prefix, suffix
  • URL normalisers
  • Default, regex

12
Features (cont.)
  • Parsing with Apache Tika
  • But some legacy parsers as well
  • Other plugins
  • CreativeCommons
  • Feeds
  • Language Identification
  • Rel tags
  • Arbitrary Metadata
  • Indexing to SOLR
  • Bespoke schema

13
Outline
  • Overview
  • Features
  • Data Structures
  • Use cases
  • What's new in Nutch 1.3
  • Nutch 2.0
  • GORA
  • Conclusion

14
Data Structures
  • MapReduce jobs gt I/O Hadoop SequenceMapFiles
  • CrawlDB gt status of known pages
  • Input of generate - index
  • Output of inject - update

15
Data Structures 2
  • Segment gt round of fetching
  • Identified by a timestamp

Segment /crawl_generate/ ? SequenceFileltText,Craw
lDatumgt /crawl_fetch/ ? MapFileltText,CrawlDatumgt
/content/ ? MapFileltText,Contentgt /crawl_parse/
? SequenceFileltText,CrawlDatumgt /parse_data/ ?
MapFileltText,ParseDatagt /parse_text/ ?
MapFileltText,ParseTextgt
  • Can have multiple versions of a page in different
    segments

16
Data Structures 3
  • linkDB gt storage for Web Graph

LinkDB
  • Output of invertlinks
  • Input of SOLRIndex

17
Outline
  • Overview
  • Features
  • Data Structures
  • Use cases
  • What's new in Nutch 1.3
  • Nutch 2.0
  • GORA
  • Conclusion

18
Use cases
  • Crawl for Search Systems
  • Web wide or vertical
  • Single node to large clusters
  • Legacy Lucene-based search or SOLR
  • but not necessarily
  • NLP (e.g.Sentiment Analysis)
  • ML, Classification / Clustering
  • Data Mining
  • SimilarPages.com
  • Large cluster on Amazon EC2 (up to 400 nodes)
  • Fetched parsed 3 billion pages
  • 10 billion pages in crawlDB (100TB data)
  • 200 million lists of similarities
  • No indexing / search involved
  • MAHOUT / UIMA / GATE
  • Use Behemoth as glueware (http//github.com/jnioch
    e/behemoth)

19
Outline
  • Overview
  • Features
  • Data Structures
  • Use cases
  • What's new in Nutch 1.3
  • Nutch 2.0
  • GORA
  • Conclusion

20
NUTCH 1.3
  • Transition between 1.x and 2.0
  • http//svn.apache.org/repos/asf/nutch/branches/bra
    nch-1.3/
  • 1.3-RC3 gt imminent
  • Removed Lucene-based indexing and search webapp
  • delegate indexing / search remotely to SOLR
  • change of focus Web search application ?
    Crawler
  • Removed deprecated parse plugins
  • delegate most parsing to Tika
  • Separate local / distributed runtimes
  • Ivy-based dependency management

21
NUTCH 2.0
  • Became trunk in 2010
  • Same features as 1.3
  • delegation to SOLR, TIKA, etc...
  • Moved to table-based architecture
  • Wealth of NoSQL projects in last 2 years
  • Preliminary version known as NutchBase (Dogacan
    Güney)
  • Moved storage layer to subproject in incubator ?
    GORA

22
GORA
  • http//incubator.apache.org/gora/
  • ORM for NoSQL databases
  • and limited SQL support
  • 0.1 released in April 2011
  • Backend implementations
  • HBase
  • Cassandra
  • SQL
  • Memory
  • Serialization with Apache AVRO
  • Object-to-datastore mappings (backend-specific)

23
GORA (cont.)
  • Atomic operations
  • Get
  • Put
  • Delete
  • Querying
  • Execute
  • deleteByQuery
  • Wrappers for Apache Hadoop
  • GORAInputOutputFormat
  • GORAMapperReducer

24
Benefits for Nutch
  • Storage still distributed and replicated
  • but one big table
  • status, metadata, content, text ? one place
  • Simplified logic in Nutch
  • Simpler code for updating / merging information
  • More efficient
  • No need to read / write entire structure to
    update records
  • e.g. update step in 1.x
  • Easier interaction with other resources
  • Third-party code just need to use GORA and schema

25
Status Nutch 2.0
  • Beta stage
  • debugging / testing required
  • Compare performance of GORA backends
  • Need to update documentation / WIKI
  • Enthusiasm from community
  • GORA next great project coming out of Nutch?

26
Future
  • Definitive move to 2.0?
  • Contribute backends and functionalities to GORA
  • New functionalities
  • Sitemap
  • Canonical tag
  • More indexers (e.g. ElasticSearch) pluggable
    indexers?
  • Delegate code to crawler-commons
  • (http//code.google.com/p/crawler-commons/)
  • Fetcher / protocol handling
  • Robots.txt parsing
  • URL normalisation / filtering

27
Outline
  • Overview
  • Features
  • Data Structures
  • Use cases
  • What's new in Nutch 1.3
  • Nutch 2.0
  • GORA
  • Conclusion

28
Where to find out more?
  • Project page http//nutch.apache.org/
  • Wiki http//wiki.apache.org/nutch/
  • Mailing lists
  • user_at_nutch.apache.org
  • dev_at_nutch.apache.org
  • Chapter in 'Hadoop the Definitive Guide' (T.
    White)
  • Understanding Hadoop is essential anyway...
  • Support / consulting
  • http//wiki.apache.org/nutch/Support
  • nutch_at_digitalpebble.com

29
Questions
?
30
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com