How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine - PowerPoint PPT Presentation

About This Presentation
Title:

How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine

Description:

Custom Web Crawler. Special considerations. full coverage. web search engines don't ... Custom Web Crawler. Special considerations. servers with multiple names ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 38
Provided by: mike90
Category:

less

Transcript and Presenter's Notes

Title: How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine


1
How to Cha-ChaLooking under the hood of the
Cha-Cha Intranet Search Engine
  • Marti Hearst
  • SIMS
  • SIMposium, April 21, 1999

2
This Talk
  • Overview of goals
  • System implementation details
  • Not
  • UI evaluation
  • related work
  • etc

3
People
  • Principles Mike Chen and Marti Hearst
  • Early coding Jason Hong
  • Early UI evaluation Jimmy Lin, Mike Chen
  • Current UI evaluation Shiang-Ling Chen

4
Cha-Cha Goals
  • Better Intranet search
  • integrate searching and browsing
  • provide context for search results
  • familiarize users with the site structure
  • UI
  • minimal browser requirement
  • widely usable HTML interface
  • build on user familiarity with existing systems

5
Intranet Search
  • Documents used in a large, diverse Intranet,
    e.g.,
  • University.edu
  • Corporation.com
  • Government.gov
  • Hypothesis It is meaningful to group search
    results according to organizational structure

6
Searching Earthquakes at UCBStandard Way
7
Searching Earthquakes at UCBwith Cha-Cha
8
Cha-Cha and Source Selection
  • Shows available sources
  • Sources are major web sites
  • User may want to navigate the source rather than
    go directly to the search hits
  • Gives hints about relative importance of various
    sources
  • Reveals the structure of the site while tightly
    integrating this structure with search
  • Users tell us anecdotally that the outline view
    is useful for finding starting points

9
System Overview
  • Collect shortest paths for each page.
  • Global paths from root of the domain
  • Local paths from root of the server
  • Select the best path based on the query
  • User interaction with the system

4. select paths generate HTML
10
Current Status
  • Over 200,000 pages indexed
  • About 2500 queries/weekday
  • Less than 3 sec/query on average
  • Five subdomains using it as site search engine
  • eecs
  • millennium project
  • sims
  • law
  • career center

11
Cha-Cha Preprocessing
12
Overview of Cha-Cha Preprocessing
  • Crawl entire Intranet
  • Store copies of pages locally
  • 200,000 pages on the UCB Intranet
  • Revisit all the pages again (on disk)
  • Create metadata for each page
  • Compute the shortest hyperlink path from a
    certain root page to every web page
  • both global and local paths
  • Index all the pages
  • Using Cheshire II (Ray Larson, SIMS)
  • Index full text, titles, shortest paths separately

13
Web Crawling Algorithm
  • Start with a list of servers to crawl
  • for UCB, simply start with www.berkeley.edu
  • Restrict crawl to certain domain(s)
  • .berkeley.edu
  • Obey No Robots standard
  • Follow hyperlinks only
  • do not read local filesystems
  • links are placed on a queue
  • traversal is breadth-first

14
Web Crawling Algorithm (cont.)
  • Interpret the HTML on each web page
  • Record the text of the page in a file on disk.
  • Make a list of all the pages that this page links
    to (outlinks)
  • Follow those links one at a time, repeating this
    procedure for each page found, until no
    unexplored pages are left.
  • links are placed on a queue
  • traversal is breadth-first
  • urls that have been crawled are stored in a hash
    table in memory, to avoid repeats

15
Custom Web Crawler
  • Special considerations
  • full coverage
  • web search engines dont go very deep
  • web search engines skip problematic sites
  • search on Berdahl at snap 430 hits
  • search on Berdahl on Cha-Cha XXX hits
  • solution
  • tag each URL with a retry counter
  • if server is down, put URL at the end of the
    queue and decrement the retry counter
  • if the counter is 0, give up on the URL

16
Custom Web Crawler
  • Special considerations
  • servers with multiple names
  • info.berkeley.edu www.sims.berkeley.edu
  • solution
  • hash the home page of the server into a table
  • whenever a new server is found, compare its
    homepage to those in the table
  • if a duplicate, record the new servers name as
    being the same as the original servers

17
Cha-Cha Metadata
  • Information about web pages
  • Title
  • Length
  • Inlinks
  • Outlinks
  • Shortest paths from a root home page

18
Metafile Generator
  • Main task find shortest path information
  • Two passes global and local
  • Global pass
  • start with main home page H (www.berkeley.edu)
  • find shortest path from H to every page in the
    system
  • for each page, keep track of how far it is from H
  • also keep track of the path that got you there
  • store this information in a disk-based storage
    manager (we use sleepycat, based on Berkeley db)
  • if a page is re-encountered using a path with a
    shorter distance, record that distance and the
    new path
  • when this is done, write out a metafile for each
    page

19
Metafile Generator (cont.)
  • Local pass
  • start with a list of all the servers found during
    the crawl
  • for each server S
  • find shortest path from S to every page in the
    system
  • do this the same way as in the global pass but
    store the results in a different database
  • when done, write out a metafile for each page, in
    a different directory than for the global pass

20
Metafile Generator (cont.)
  • Combine local and global path information
  • Purpose
  • locality should trump global paths, but not all
    local pages are reachable locally
  • example
  • the shortest path from www.berkeley.edu to
    www.sims.berkeley.edu/hearst is
  • www.berkeley.edu -gt search.berkeley.edu -gt
    cha-cha.berkeley.edu -gt www.sims.berkeley.edu/hea
    rst
  • but we want my home page to be under the SIMS
    faculty listing
  • solution let local trump global
  • example

21
Metafile Generator (cont.)
  • Combine local and global path information
  • How to do it
  • go through the metafiles in the global directory
  • for each metafile
  • if there already is a metafile for that url in
    the local directory, skip this metafile
  • otherwise (there is not metafile for this url
    locally) copy the metafile into the local
    directory
  • Why not just use local metafiles?
  • some pages are not linked to within their own
    domain
  • e.g., student association hosted within a
    particular students domain

22
Sample Cha-Cha Metadata file
ltMETAFILEgt ltUrlgthttp//www.sims.berkeley.edu/lt/Url
gt ltTitlegtWelcome to SIMSlt/Titlegt ltDategtnulllt/Dategt
ltSizegt4865lt/Sizegt lt!-- INLINKS
--gt ltInlinkCountgt1lt/InlinkCountgt ltInlinksgthttp//w
ww-resources.berkeley.edu/nhpteaching/lt/Inlinksgt
lt!-- OUTLINKS --gt ltOutlinkCountgt21lt/OutlinkCountgt
ltOutlinksgthttp//www.sims.berkeley.edu/about.html
http//www.sims.berkeley.edu/search.html http//ww
w.sims.berkeley.edu/events/conferences/ http//www
.sims.berkeley.edu/resources/sites.html http//www
.sims.berkeley.edu/people/masters.html
23
Cha-Cha Metadata File, cont.
lt!-- SHORTEST_PATHS --gt ltDepthgt2lt/Depthgt ltShortes
tPathsCountgt1lt/ShortestPathsCountgt ltShortestPathsgt
Welcome to UC Berkeley http//www.berkeley.edu/ UC
Berkeley Teaching Units http//www-resources.berk
eley.edu/nhpteaching/ lt/ShortestPathsgt lt!--
MIRROR URLS --gt ltMirrorCountgt0lt/MirrorCountgt lt!--
DATA_FILE --gt ltFilegt/projects/cha-cha/development/
data/done/text/ www.sims.berkeley.edu/index.htmllt/
Filegt lt/METAFILEgt
24
CHESHIRE II
  • Search back-end for Cha-Cha
  • Ray Larson et al. ASIS 95, JASIS 96
  • CHESHIRE II system
  • Full Service Full Text Search
  • Client/Server architecture
  • Z39.50 IR protocol
  • Interprets documents written in SGML
  • Probabilistic Ranking
  • Flexible data representation

25
CHESHIRE II (cont.)
  • A big advantage of Cheshire
  • dont have to write a special parser for special
    document types
  • instead, simply create one DTD and the system
    takes care of parsing the metafiles for us
  • A related advantage
  • can create indexes on individual components of
    the document
  • allows efficient title search, home page search,
    domain-based search, without extra programming

26
Cha-Cha Document Type Definition
lt!SGML "ISO 88791986" -- -- CHARSET
BASESET "ISO 6461983//CHARSET
International Reference Version (IRV)//ESC 2/5
4/0" DESCSET 0 9 UNUSED
9 2 9 11 2 UNUSED
13 1 13 14
18 UNUSED 32 95 32
127 1 UNUSED BASESET "ISO
Registration Number 100//CHARSET
ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC
2/13 4/1" DESCSET 128 32 UNUSED
160 95 32 255 1 UNUSED
27
Cha-Cha DTD, cont. (parts omitted)
lt!doctype METADATA lt!-- This is a DTD for
metadata records extracted from the HTML
files in the cha-cha system. The tagging is
simple with nothing particular about it. The
structure has been kept flat within the
individual records. The only somewhat
interesting thing is the TEXT-REF tag which
is used to contain a reference to the full
text of entry stored in the raw HTML form.
--gt lt!ELEMENT METADATA o o (METAFILE)gt
28
Cha-Cha DTD, cont. (parts omitted)
lt!-- We allow most elements to occur any number
of times in any order --gt lt!-- this is because
there is little consistency in the actual usage.
--gt lt!ELEMENT METAFILE - - (URL, TITLE, DATE,
SIZE, INLINKCOUNT, INLINKS, OUTLINKCOUNT,
OUTLINKS, DEPTH?, SHORTESTPATHSCOUNT?,
SHORTESTPATHS?, MIRRORCOUNT?, MIRRORURLS?,
TYPE?, DOMAIN?, FILE?)gt lt!-- We won't make any
assumptions about content... all PCDATA
--gt lt!ELEMENT URL - o (PCDATA)gt lt!ELEMENT DATE
- o (PCDATA)gt lt!ELEMENT TITLE - o
(PCDATA)gt lt!ELEMENT SIZE - o (PCDATA)gt lt!ELEMEN
T INLINKCOUNT - o (PCDATA)gt lt!ELEMENT INLINKS
- o (PCDATA)gt lt!ELEMENT OUTLINKCOUNT - o
(PCDATA)gt lt!ELEMENT OUTLINKS - o
(PCDATA)gt lt!ELEMENT DEPTH - o
(PCDATA)gt lt!ELEMENT SHORTESTPATHSCOUNT - o
(PCDATA)gt lt!ELEMENT SHORTESTPATHS - o
(PCDATA)gt
29
Cha-Cha Online Processing
30
Responding to the User Query
  • User searches on pam samuelson
  • Search Engine looks up documents indexed with one
    or both terms in its inverted index
  • Search Engine looks up titles and shortest paths
    in the metadata index
  • User Interface combines the information and
    presents the results as HTML

31
Building the Outline View
  • Main issue how to combine shortest paths
  • There are approximately three shortest paths per
    web page
  • We assume users do not want to see the page
    multiple times
  • Strategy
  • Group hits together within the hierarchy
  • Try to avoid showing subhierarchies with
    singleton hits
  • This assumption is based on part on evidence from
    our earlier clustering research that relevant
    documents tend to cluster near one another

32
Building the Outline View (cont.)
  • Goals of the algorithm
  • (I) Group (recursively) as many pages together
    within a subhierarchy as possible
  • Avoid (recursively) branches that terminate in
    only one hit (leaf)
  • (II) Remove as many internal nodes as possible
    while while stil retaining at least one valid
    path to every leaf
  • (iii) Remove as many edges as possible while
    retaining at lesat one path to every leaf

33
Building the Outline View (cont.)
  • To achieve these goals we need a non-standard
    graph algorithm
  • To do it properly, every possible subset of nodes
    at depth D should be considered to determine the
    minimal subset which covers all nodes at depth
    D1
  • This is inefficient -- would require 2k checks
    for k nodes at depth D
  • Instead, we use a heuristic approach which
    approximates the optimal results

34
Building the Outline View (cont.)
  • First, a top-down pass
  • record depth of each node and the number of
    children it links to directly
  • Second, a bottom-up pass
  • identify the deepest nodes (the leaves)
  • D lt- the set of nodes that are parents of leaves
  • Sort D ascending according to how many active
    children they link to at depth D1
  • A node is active if it has not been eliminated

35
Building the Outline View (cont.)
  • Bottom-up pass, continued
  • every node is a candidate to be eliminated
  • those nodes with the least number of children are
    eliminated first
  • because of goal (I)
  • for each candidate C, if C links to one or more
    active nodes at depth D1 that are not covered by
    any active nodes, then C cannot be eliminated.
    Otherwise, C is removed from the active list
  • After a level D is complete, there are no active
    nodes at depth D that cover exclusively nodes
    that are also covered by another node at depth D

36
Building the Outline View (cont.)
  • Retaining rank ordering
  • Build up the tree by first placing in the tree
    the hit (leaf) that is highest ranked
  • As more leaves are added, more parts of the
    hierarchy are added, but the order in which the
    parts of the hierarchy are added is retained
  • When the hierarchy has been built, it is
    traversed to create the HTML listing

37
Summary
  • Better user interfaces for search should
  • Help users understand starting points/sources
  • Places results of search into an organizing
    context
  • One (of many) approaches
  • Cha-Cha simultaneously browse and search
    intranet site context
  • Future work
  • Special handling for short queries
  • Spelling corrections suggestions
  • Smarter paths
Write a Comment
User Comments (0)
About PowerShow.com