Cha Cha A system for Organizing Intranet Search Results Contextualized Hierarchical information Access by Chen, Hearst - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Cha Cha A system for Organizing Intranet Search Results Contextualized Hierarchical information Access by Chen, Hearst

Description:

Most search Engines present search results as a ranked list ... Web crawler. Phase 2. Meta. File Generator. Inverted. Index. WWW. Web Pages. Meta Files. Text ... – PowerPoint PPT presentation

Number of Views:345
Avg rating:3.0/5.0
Slides: 13
Provided by: Gue141
Category:

less

Transcript and Presenter's Notes

Title: Cha Cha A system for Organizing Intranet Search Results Contextualized Hierarchical information Access by Chen, Hearst


1
Cha Cha A system for Organizing Intranet
Search ResultsContextualized Hierarchical
information Access by Chen, Hearst Associates
2
Problems
  • Most search Engines present search results as a
    ranked list
  • No information about the context in which pages
    exist and their relationship with each other
  • Difficult to tell about the relationships between
    the search results
  • Query is vague, only a few words heterogeneous
    result set

3
(No Transcript)
4
Prior work
  • Super Book
  • AMIT
  • WebTOC
  • WebGlimse
  • Connectivity Server
  • WebCutter (Mapuccino)

5
(No Transcript)
6
Off-Line Processing
Phase 1 Web crawler
Phase 2 Meta File Generator
WWW
Web Pages
Meta Files
Text
Phase 3 Inverted Index Generator
Inverted Index
7
On-Line Processing
USER
Cha Cha Front end
Query Processor
Graph Generator
Abstract Generator
HTML Generator
Inverted Index
Inverted Index
8
Goals
  • Group as many pages together within a sub
    hierarchy as possible avoid branches that
    terminate into only one leaf
  • Remove as many internal nodes as possible while
    retaining at least one path to every leaf

9
Algorithm
  • Form a graph
  • Top down First pass to set the number of children
    and depth of each node
  • Bottom up pass. For each level from the bottom to
    the root
  • Find all the non leaf node at this level
  • Sort them according to the active number of
    children
  • For each node in this level
  • i.      See if it has at least one active
    children which does not have another parent.
  •        ii.    If no eliminate this node
  • Sort the leaf nodes in ascending order according
    to their rank in the search result.
  • Start with an empty tree and first hit, a path is
    found from that hit through the active nodes at
    each level above it to the root. Whenever there
    is a choice of parent choose the parent with most
    number of nodes.
  • Repeat this for all other leaf nodes according to
    their rank. Whenever there is a choice of parent
    choose the parent that is already in the tree.
  • Convert the tree to HTML hierarchy by traversing
    it in a depth first manner. The choice of which
    sibling to traverse next is determined by the
    order in which the sibling were entered into the
    tree.

10
Graph Merging Example
Before pruning
After pruning
1
3
1
2
2
3
4
1
3
2
3
4
3
2
7
6
5
8
7
6
5
8
11
Results
  • 5 hours to build Inverted Index for a Intranet
    with 200,000 webpage
  • Query time 3.02 seconds, of which 1 sec by Java
    front end ( Sparc Ultra II)
  • 2.4 secs on Sun Enterprise 450

12
Questions
Write a Comment
User Comments (0)
About PowerShow.com