Adaptive Focused Crawling - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Adaptive Focused Crawling

Description:

Large amount of info on web. Standard crawler: traverses web downloading all ... Truism: Web is growing faster than search engines ... – PowerPoint PPT presentation

Number of Views:1300
Avg rating:3.0/5.0
Slides: 37
Provided by: acri5
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Focused Crawling


1
Adaptive Focused Crawling
  • Dr. Alexandra Cristea
  • a.i.cristea_at_warwick.ac.uk
  • http//www.dcs.warwick.ac.uk/acristea/


2
1. Contents
  • Introduction
  • Crawlers and the World Wide Web
  • Focused Crawling
  • Agent-Based Adaptive Focused Crawling
  • Machine Learning-Based Adaptive Focused Crawling
  • Evaluation Methodologies
  • Conclusions

3
Motivation
  • Large amount of info on web
  • Standard crawler traverses web downloading all
  • Focused, adaptive crawler selects only related
    documents, ignores rest

4
Introduction
5
A focused crawler retrieval
6
Adaptive Focused Crawler
  • Traditional non-adaptive focused crawlers
    suitable for user communities w shared interests
    goals that do not change with time.
  • focused crawler learning methods to adapt its
    behavior to the particular environment and its
    relationships with the given input parameters
    (e.g. set of retrieved pages and the user-defined
    topic ) gtgt adaptive
  • adaptive fc
  • for personalized search systems w info needs,
    users interests, goals, preferences, etc..
  • for single users and not communities of people.
  • sensitive to potential alterations in the
    environment.

7
Crawlers and the WWW
8
Growth and size of the Web
  • Growth
  • 2005 at least 11.5 billion pages
  • Doubling in less than 2 years
  • http//www.worldwidewebsize.com/
  • Today (2008) Indexable 23 billion pages
  • Change
  • 23 changes daily, 40 within a week
  • Challenge search engines local copies
  • Crawls time consuming gt tradeoffs needed
  • Alternatives
  • google Sitemaps an XML file lists web site pages
    and how often they change (push instead of pull)
  • distributed crawling
  • Truism Web is growing faster than search engines

9
Reaching the Web Hypertextual Connectivity and
Deep Web
  • Dark matter info not accessible to search
    engines
  • Page sets In, Out, SCC (Strongly Connected
    Component)

What happens if you crawl from Out?
10
Deep Web
  • dynamic page generators
  • estimate public information on the deep Web is
    in 2001 up to 550 times larger than the normally
    accessible Web
  • databases

11
Crawling Strategies
  • Important pages first ordering metrics e.g.,
    Breadth-First, Backlink, PageRank

12
Backlink
  • of pages linking in
  • Based on bibliographic research
  • Local minima issue
  • Based on this PageRank, HITS

13
Focused Crawling
  • Exploiting additional info on web pages, such as
    anchors or text surrounding the links, to skip
    some of the pages encountered

14
Exploiting the Hypertextual Info
  • Links and (topical) locality are just as
    important as IR info

15
Fish search
  • Input users query starting URLs (e.g.,
    bookmarks) priority list
  • First in list downloaded, scored a heuristic
    decides if to continue w that direction if not,
    its links will be ignored
  • If yes, links are scanned, each w a depth value
    (e.g., parent -1) when depth is zero, direction
    is dropped
  • Timeout, max no pages is also possible
  • Very heavy and demanding big web-burden

16
Other focused crawlers
  • Taxonomy and distillation
  • Classifier evaluates relevance of hypertext docs
    regarding topic
  • Distiller identifies nodes as access points to
    pages (via HITS algorithm)
  • Tunneling
  • allow a limited no of bad pages, to avoid
    loosing info (close topic pages may not point to
    each other)
  • Contextual crawling
  • Context graph for each page with a related
    distance (min no links to traverse from initial
    set)
  • Naïve Bayes classifiers category
    identification, according to distance
    predictions of a generic documents distance is
    possible
  • Problem reverse link info
  • Semantic Web
  • Ontologies
  • Improvements in performance

17
Agent-based Adaptive Focused Crawling
  • Genetic-based
  • Ants

18
Genetic-based crawling
  • GA
  • approximate solutions to hard-to-solve
    combinatorial optimization problems
  • genetic operators inheritance, mutation,
    crossover population evolution
  • GA crawler agents (InfoSpiders http//www.informat
    ics.indiana.edu/fil/IS/ )
  • genotype (chromosome set) defining search
    behaviour
  • trust in out-links
  • query terms
  • weights (uniform distribution intially FF NN
    info latersupervised/unsupervised BP)
  • Energy Benefit() Cost() (Fitness)

19
Genotype and NN
Relevant/ irrelevant
20
Algorithm 1. Pseudo-code of the InfoSpiders
algorithm
  • initialize each agents genotype, energy and
    starting page
  • PAGES ?maximum number of pages to visit
  • while number of visited pages lt PAGES do
  • while for each agent a do
  • pick and visit an out-link from the
    current agents page
  • update the energy estimating benefit() -
    cost()
  • update the genotype as a function of the
    current benefit
  • if agents energy gt THRESHOLD then
  • apply the genetic operators to
    produce offspring
  • else
  • kill the agent end if
  • end while
  • end while

21
Ant-based Crawling
  • Collective intelligence
  • Simple individual behaviour, complex results
    (shortest path)
  • Pheromone trail

22
Ant Crawling preferred path
23
Transition probabilities (p) according to
pheromone trails (?)
24
Task accomplishing behaviors
  • 1. at the end of the cycle, the agent updates the
    pheromone trails of the followed path and places
    itself in one of the start resources
  • 2. if an ant trail exists, the agent decides to
    follow it with a probability which is function of
    the respective pheromone intensity
  • 3. if the agent does not have any available
    information, it moves randomly

25
Transition probability
Link between i,l
where tij(t) corresponds to the pheromone trail
between urli and urlj
26
Trail updating
p(k) is the ordered set of pages visited by the
k-ant
p(k)i is the i-th element of p(k)
score(p) returns for each page p, the similarity
measure with current info needs 0, 1, where 1
is the highest similarity
M of ants
? is the trail evaporation coefficient
27
Algorithm 2. Pseudo-code of the Ant-based crawler.
  • initialize each agents starting page
  • PAGES ?maximum number of pages to visit
  • cycle ? 1
  • t ? 1
  • while number of visited pages lt PAGES do
  • while for each agent a do
  • for move 0 to cycle do
  • calculate the probabilities
    Pij(t) of the out-going links as in Eq.
  • select the next page to visit
    for the agent a
  • end for
  • end while
  • update all the pheromone trails
  • initialize each agents starting page
  • cycle ? cycle 1
  • t ? t 1
  • end while

28
Machine Learning-Based Adaptive Focused Crawling
  • Intelligent Crawling Statistical Model
  • Reinforcement Learning-Based Approaches

29
Intelligent Crawling Statistical Model
  • statistically learn the characteristics of the
    Webs linkage structure while performing the
    search
  • Unseen page predicates (content of pages linking
    in, tokens on unseen page)
  • Evidence E is used to update probability of
    relation to users needs.

30
Evidence-based update
  • P(CE) gt P(C)
  • Interest Ratio
  • I(C,E) P(CE) / P(C)
  • P(C n E) / P(C)P(E)

e.g., E 10 of the pages pointing in contain
Bach
No initial collection needed At the beginning,
users specify their needs via predicates, e.g.,
the page content or the title must contain a
given set of keywords.
31
Reinforcement Learning-Based Approaches
  • Traditional focused crawler
  • apprentice assigns priorities to unvisited URLs
    (based on DOM features) for the next steps of
    crawling
  • Naïve Bayes text classifiers compare text around
    links to next steps

DOM Document Object Model
32
Evaluation Methodologies
  • For fixed Web corpus and standard crawl
  • computation time to complete the crawl, or
  • the number of downloaded resources per time unit
  • For focused crawl
  • We need correctly retrieved documents only, not
    all of them so

33
Focused Crawling
  • Precision
  • Pr found / (found false alarm)
  • Recall
  • Rr found / (found miss)

34
Precision / Recall
35
Conclusions
  • Focused Crawling interesting alternative to Web
    search
  • Adaptive Focused Crawlers
  • Learning methods are able to adapt the system
    behaviour to a particular environment and input
    parameters during the search
  • Dark matter research, NLP, Semantic Web

36
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com