How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine - PowerPoint PPT Presentation

About This Presentation

Title:

How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine

Description:

Custom Web Crawler. Special considerations. full coverage. web search engines don't ... Custom Web Crawler. Special considerations. servers with multiple names ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 38

Provided by: mike90

Learn more at: https://people.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine

1
How to Cha-ChaLooking under the hood of the
Cha-Cha Intranet Search Engine

Marti Hearst
SIMS
SIMposium, April 21, 1999

2
This Talk

Overview of goals
System implementation details
Not
UI evaluation
related work
etc

3
People

Principles Mike Chen and Marti Hearst
Early coding Jason Hong
Early UI evaluation Jimmy Lin, Mike Chen
Current UI evaluation Shiang-Ling Chen

4
Cha-Cha Goals

Better Intranet search
integrate searching and browsing
provide context for search results
familiarize users with the site structure
UI
minimal browser requirement
widely usable HTML interface
build on user familiarity with existing systems

5
Intranet Search

Documents used in a large, diverse Intranet,
e.g.,
University.edu
Corporation.com
Government.gov
Hypothesis It is meaningful to group search
results according to organizational structure

6
Searching Earthquakes at UCBStandard Way
7
Searching Earthquakes at UCBwith Cha-Cha
8
Cha-Cha and Source Selection

Shows available sources
Sources are major web sites
User may want to navigate the source rather than
go directly to the search hits
Gives hints about relative importance of various
sources
Reveals the structure of the site while tightly
integrating this structure with search
Users tell us anecdotally that the outline view
is useful for finding starting points

9
System Overview

Collect shortest paths for each page.
Global paths from root of the domain
Local paths from root of the server
Select the best path based on the query
User interaction with the system

4. select paths generate HTML
10
Current Status

Over 200,000 pages indexed
About 2500 queries/weekday
Less than 3 sec/query on average
Five subdomains using it as site search engine
eecs
millennium project
sims
law
career center

11
Cha-Cha Preprocessing
12
Overview of Cha-Cha Preprocessing

Crawl entire Intranet
Store copies of pages locally
200,000 pages on the UCB Intranet
Revisit all the pages again (on disk)
Create metadata for each page
Compute the shortest hyperlink path from a
certain root page to every web page
both global and local paths
Index all the pages
Using Cheshire II (Ray Larson, SIMS)
Index full text, titles, shortest paths separately

13
Web Crawling Algorithm

Start with a list of servers to crawl
for UCB, simply start with www.berkeley.edu
Restrict crawl to certain domain(s)
.berkeley.edu
Obey No Robots standard
Follow hyperlinks only
do not read local filesystems
links are placed on a queue
traversal is breadth-first

14
Web Crawling Algorithm (cont.)

Interpret the HTML on each web page
Record the text of the page in a file on disk.
Make a list of all the pages that this page links
to (outlinks)
Follow those links one at a time, repeating this
procedure for each page found, until no
unexplored pages are left.
links are placed on a queue
traversal is breadth-first
urls that have been crawled are stored in a hash
table in memory, to avoid repeats

15
Custom Web Crawler

Special considerations
full coverage
web search engines dont go very deep
web search engines skip problematic sites
search on Berdahl at snap 430 hits
search on Berdahl on Cha-Cha XXX hits
solution
tag each URL with a retry counter
if server is down, put URL at the end of the
queue and decrement the retry counter
if the counter is 0, give up on the URL

16
Custom Web Crawler

Special considerations
servers with multiple names
info.berkeley.edu www.sims.berkeley.edu
solution
hash the home page of the server into a table
whenever a new server is found, compare its
homepage to those in the table
if a duplicate, record the new servers name as
being the same as the original servers

17
Cha-Cha Metadata

Information about web pages
Title
Length
Inlinks
Outlinks
Shortest paths from a root home page

18
Metafile Generator

Main task find shortest path information
Two passes global and local
Global pass
start with main home page H (www.berkeley.edu)
find shortest path from H to every page in the
system
for each page, keep track of how far it is from H
also keep track of the path that got you there
store this information in a disk-based storage
manager (we use sleepycat, based on Berkeley db)
if a page is re-encountered using a path with a
shorter distance, record that distance and the
new path
when this is done, write out a metafile for each
page

19
Metafile Generator (cont.)

Local pass
start with a list of all the servers found during
the crawl
for each server S
find shortest path from S to every page in the
system
do this the same way as in the global pass but
store the results in a different database
when done, write out a metafile for each page, in
a different directory than for the global pass

20
Metafile Generator (cont.)

Combine local and global path information
Purpose
locality should trump global paths, but not all
local pages are reachable locally
example
the shortest path from www.berkeley.edu to
www.sims.berkeley.edu/hearst is
www.berkeley.edu -gt search.berkeley.edu -gt
cha-cha.berkeley.edu -gt www.sims.berkeley.edu/hea
rst
but we want my home page to be under the SIMS
faculty listing
solution let local trump global
example

21
Metafile Generator (cont.)

Combine local and global path information
How to do it
go through the metafiles in the global directory
for each metafile
if there already is a metafile for that url in
the local directory, skip this metafile
otherwise (there is not metafile for this url
locally) copy the metafile into the local
directory
Why not just use local metafiles?
some pages are not linked to within their own
domain
e.g., student association hosted within a
particular students domain

22
Sample Cha-Cha Metadata file
ltMETAFILEgt ltUrlgthttp//www.sims.berkeley.edu/lt/Url
gt ltTitlegtWelcome to SIMSlt/Titlegt ltDategtnulllt/Dategt
ltSizegt4865lt/Sizegt lt!-- INLINKS
--gt ltInlinkCountgt1lt/InlinkCountgt ltInlinksgthttp//w
ww-resources.berkeley.edu/nhpteaching/lt/Inlinksgt
lt!-- OUTLINKS --gt ltOutlinkCountgt21lt/OutlinkCountgt
ltOutlinksgthttp//www.sims.berkeley.edu/about.html
http//www.sims.berkeley.edu/search.html http//ww
w.sims.berkeley.edu/events/conferences/ http//www
.sims.berkeley.edu/resources/sites.html http//www
.sims.berkeley.edu/people/masters.html
23
Cha-Cha Metadata File, cont.
lt!-- SHORTEST_PATHS --gt ltDepthgt2lt/Depthgt ltShortes
tPathsCountgt1lt/ShortestPathsCountgt ltShortestPathsgt
Welcome to UC Berkeley http//www.berkeley.edu/ UC
Berkeley Teaching Units http//www-resources.berk
eley.edu/nhpteaching/ lt/ShortestPathsgt lt!--
MIRROR URLS --gt ltMirrorCountgt0lt/MirrorCountgt lt!--
DATA_FILE --gt ltFilegt/projects/cha-cha/development/
data/done/text/ www.sims.berkeley.edu/index.htmllt/
Filegt lt/METAFILEgt
24
CHESHIRE II

Search back-end for Cha-Cha
Ray Larson et al. ASIS 95, JASIS 96
CHESHIRE II system
Full Service Full Text Search
Client/Server architecture
Z39.50 IR protocol
Interprets documents written in SGML
Probabilistic Ranking
Flexible data representation

25
CHESHIRE II (cont.)

A big advantage of Cheshire
dont have to write a special parser for special
document types
instead, simply create one DTD and the system
takes care of parsing the metafiles for us
A related advantage
can create indexes on individual components of
the document
allows efficient title search, home page search,
domain-based search, without extra programming

26
Cha-Cha Document Type Definition
lt!SGML "ISO 88791986" -- -- CHARSET
BASESET "ISO 6461983//CHARSET
International Reference Version (IRV)//ESC 2/5
4/0" DESCSET 0 9 UNUSED
9 2 9 11 2 UNUSED
13 1 13 14
18 UNUSED 32 95 32
127 1 UNUSED BASESET "ISO
Registration Number 100//CHARSET
ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC
2/13 4/1" DESCSET 128 32 UNUSED
160 95 32 255 1 UNUSED
27
Cha-Cha DTD, cont. (parts omitted)
lt!doctype METADATA lt!-- This is a DTD for
metadata records extracted from the HTML
files in the cha-cha system. The tagging is
simple with nothing particular about it. The
structure has been kept flat within the
individual records. The only somewhat
interesting thing is the TEXT-REF tag which
is used to contain a reference to the full
text of entry stored in the raw HTML form.
--gt lt!ELEMENT METADATA o o (METAFILE)gt
28
Cha-Cha DTD, cont. (parts omitted)
lt!-- We allow most elements to occur any number
of times in any order --gt lt!-- this is because
there is little consistency in the actual usage.
--gt lt!ELEMENT METAFILE - - (URL, TITLE, DATE,
SIZE, INLINKCOUNT, INLINKS, OUTLINKCOUNT,
OUTLINKS, DEPTH?, SHORTESTPATHSCOUNT?,
SHORTESTPATHS?, MIRRORCOUNT?, MIRRORURLS?,
TYPE?, DOMAIN?, FILE?)gt lt!-- We won't make any
assumptions about content... all PCDATA
--gt lt!ELEMENT URL - o (PCDATA)gt lt!ELEMENT DATE
- o (PCDATA)gt lt!ELEMENT TITLE - o
(PCDATA)gt lt!ELEMENT SIZE - o (PCDATA)gt lt!ELEMEN
T INLINKCOUNT - o (PCDATA)gt lt!ELEMENT INLINKS
- o (PCDATA)gt lt!ELEMENT OUTLINKCOUNT - o
(PCDATA)gt lt!ELEMENT OUTLINKS - o
(PCDATA)gt lt!ELEMENT DEPTH - o
(PCDATA)gt lt!ELEMENT SHORTESTPATHSCOUNT - o
(PCDATA)gt lt!ELEMENT SHORTESTPATHS - o
(PCDATA)gt
29
Cha-Cha Online Processing
30
Responding to the User Query

User searches on pam samuelson
Search Engine looks up documents indexed with one
or both terms in its inverted index
Search Engine looks up titles and shortest paths
in the metadata index
User Interface combines the information and
presents the results as HTML

31
Building the Outline View

Main issue how to combine shortest paths
There are approximately three shortest paths per
web page
We assume users do not want to see the page
multiple times
Strategy
Group hits together within the hierarchy
Try to avoid showing subhierarchies with
singleton hits
This assumption is based on part on evidence from
our earlier clustering research that relevant
documents tend to cluster near one another

32
Building the Outline View (cont.)

Goals of the algorithm
(I) Group (recursively) as many pages together
within a subhierarchy as possible
Avoid (recursively) branches that terminate in
only one hit (leaf)
(II) Remove as many internal nodes as possible
while while stil retaining at least one valid
path to every leaf
(iii) Remove as many edges as possible while
retaining at lesat one path to every leaf

33
Building the Outline View (cont.)

To achieve these goals we need a non-standard
graph algorithm
To do it properly, every possible subset of nodes
at depth D should be considered to determine the
minimal subset which covers all nodes at depth
D1
This is inefficient -- would require 2k checks
for k nodes at depth D
Instead, we use a heuristic approach which
approximates the optimal results

34
Building the Outline View (cont.)

First, a top-down pass
record depth of each node and the number of
children it links to directly
Second, a bottom-up pass
identify the deepest nodes (the leaves)
D lt- the set of nodes that are parents of leaves
Sort D ascending according to how many active
children they link to at depth D1
A node is active if it has not been eliminated

35
Building the Outline View (cont.)

Bottom-up pass, continued
every node is a candidate to be eliminated
those nodes with the least number of children are
eliminated first
because of goal (I)
for each candidate C, if C links to one or more
active nodes at depth D1 that are not covered by
any active nodes, then C cannot be eliminated.
Otherwise, C is removed from the active list
After a level D is complete, there are no active
nodes at depth D that cover exclusively nodes
that are also covered by another node at depth D

36
Building the Outline View (cont.)

Retaining rank ordering
Build up the tree by first placing in the tree
the hit (leaf) that is highest ranked
As more leaves are added, more parts of the
hierarchy are added, but the order in which the
parts of the hierarchy are added is retained
When the hierarchy has been built, it is
traversed to create the HTML listing

37
Summary