Title: How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine
1How to Cha-ChaLooking under the hood of the
Cha-Cha Intranet Search Engine
- Marti Hearst
- SIMS
- SIMposium, April 21, 1999
2This Talk
- Overview of goals
- System implementation details
- Not
- UI evaluation
- related work
- etc
3People
- Principles Mike Chen and Marti Hearst
- Early coding Jason Hong
- Early UI evaluation Jimmy Lin, Mike Chen
- Current UI evaluation Shiang-Ling Chen
4Cha-Cha Goals
- Better Intranet search
- integrate searching and browsing
- provide context for search results
- familiarize users with the site structure
- UI
- minimal browser requirement
- widely usable HTML interface
- build on user familiarity with existing systems
5Intranet Search
- Documents used in a large, diverse Intranet,
e.g., - University.edu
- Corporation.com
- Government.gov
- Hypothesis It is meaningful to group search
results according to organizational structure
6Searching Earthquakes at UCBStandard Way
7Searching Earthquakes at UCBwith Cha-Cha
8Cha-Cha and Source Selection
- Shows available sources
- Sources are major web sites
- User may want to navigate the source rather than
go directly to the search hits - Gives hints about relative importance of various
sources - Reveals the structure of the site while tightly
integrating this structure with search - Users tell us anecdotally that the outline view
is useful for finding starting points
9System Overview
- Collect shortest paths for each page.
- Global paths from root of the domain
- Local paths from root of the server
- Select the best path based on the query
- User interaction with the system
4. select paths generate HTML
10Current Status
- Over 200,000 pages indexed
- About 2500 queries/weekday
- Less than 3 sec/query on average
- Five subdomains using it as site search engine
- eecs
- millennium project
- sims
- law
- career center
11Cha-Cha Preprocessing
12Overview of Cha-Cha Preprocessing
- Crawl entire Intranet
- Store copies of pages locally
- 200,000 pages on the UCB Intranet
- Revisit all the pages again (on disk)
- Create metadata for each page
- Compute the shortest hyperlink path from a
certain root page to every web page - both global and local paths
- Index all the pages
- Using Cheshire II (Ray Larson, SIMS)
- Index full text, titles, shortest paths separately
13Web Crawling Algorithm
- Start with a list of servers to crawl
- for UCB, simply start with www.berkeley.edu
- Restrict crawl to certain domain(s)
- .berkeley.edu
- Obey No Robots standard
- Follow hyperlinks only
- do not read local filesystems
- links are placed on a queue
- traversal is breadth-first
14Web Crawling Algorithm (cont.)
- Interpret the HTML on each web page
- Record the text of the page in a file on disk.
- Make a list of all the pages that this page links
to (outlinks) - Follow those links one at a time, repeating this
procedure for each page found, until no
unexplored pages are left. - links are placed on a queue
- traversal is breadth-first
- urls that have been crawled are stored in a hash
table in memory, to avoid repeats
15Custom Web Crawler
- Special considerations
- full coverage
- web search engines dont go very deep
- web search engines skip problematic sites
- search on Berdahl at snap 430 hits
- search on Berdahl on Cha-Cha XXX hits
- solution
- tag each URL with a retry counter
- if server is down, put URL at the end of the
queue and decrement the retry counter - if the counter is 0, give up on the URL
16Custom Web Crawler
- Special considerations
- servers with multiple names
- info.berkeley.edu www.sims.berkeley.edu
- solution
- hash the home page of the server into a table
- whenever a new server is found, compare its
homepage to those in the table - if a duplicate, record the new servers name as
being the same as the original servers
17Cha-Cha Metadata
- Information about web pages
- Title
- Length
- Inlinks
- Outlinks
- Shortest paths from a root home page
18Metafile Generator
- Main task find shortest path information
- Two passes global and local
- Global pass
- start with main home page H (www.berkeley.edu)
- find shortest path from H to every page in the
system - for each page, keep track of how far it is from H
- also keep track of the path that got you there
- store this information in a disk-based storage
manager (we use sleepycat, based on Berkeley db) - if a page is re-encountered using a path with a
shorter distance, record that distance and the
new path - when this is done, write out a metafile for each
page
19Metafile Generator (cont.)
- Local pass
- start with a list of all the servers found during
the crawl - for each server S
- find shortest path from S to every page in the
system - do this the same way as in the global pass but
store the results in a different database - when done, write out a metafile for each page, in
a different directory than for the global pass
20Metafile Generator (cont.)
- Combine local and global path information
- Purpose
- locality should trump global paths, but not all
local pages are reachable locally - example
- the shortest path from www.berkeley.edu to
www.sims.berkeley.edu/hearst is - www.berkeley.edu -gt search.berkeley.edu -gt
cha-cha.berkeley.edu -gt www.sims.berkeley.edu/hea
rst - but we want my home page to be under the SIMS
faculty listing - solution let local trump global
- example
21Metafile Generator (cont.)
- Combine local and global path information
- How to do it
- go through the metafiles in the global directory
- for each metafile
- if there already is a metafile for that url in
the local directory, skip this metafile - otherwise (there is not metafile for this url
locally) copy the metafile into the local
directory - Why not just use local metafiles?
- some pages are not linked to within their own
domain - e.g., student association hosted within a
particular students domain
22Sample Cha-Cha Metadata file
ltMETAFILEgt ltUrlgthttp//www.sims.berkeley.edu/lt/Url
gt ltTitlegtWelcome to SIMSlt/Titlegt ltDategtnulllt/Dategt
ltSizegt4865lt/Sizegt lt!-- INLINKS
--gt ltInlinkCountgt1lt/InlinkCountgt ltInlinksgthttp//w
ww-resources.berkeley.edu/nhpteaching/lt/Inlinksgt
lt!-- OUTLINKS --gt ltOutlinkCountgt21lt/OutlinkCountgt
ltOutlinksgthttp//www.sims.berkeley.edu/about.html
http//www.sims.berkeley.edu/search.html http//ww
w.sims.berkeley.edu/events/conferences/ http//www
.sims.berkeley.edu/resources/sites.html http//www
.sims.berkeley.edu/people/masters.html
23Cha-Cha Metadata File, cont.
lt!-- SHORTEST_PATHS --gt ltDepthgt2lt/Depthgt ltShortes
tPathsCountgt1lt/ShortestPathsCountgt ltShortestPathsgt
Welcome to UC Berkeley http//www.berkeley.edu/ UC
Berkeley Teaching Units http//www-resources.berk
eley.edu/nhpteaching/ lt/ShortestPathsgt lt!--
MIRROR URLS --gt ltMirrorCountgt0lt/MirrorCountgt lt!--
DATA_FILE --gt ltFilegt/projects/cha-cha/development/
data/done/text/ www.sims.berkeley.edu/index.htmllt/
Filegt lt/METAFILEgt
24CHESHIRE II
- Search back-end for Cha-Cha
- Ray Larson et al. ASIS 95, JASIS 96
- CHESHIRE II system
- Full Service Full Text Search
- Client/Server architecture
- Z39.50 IR protocol
- Interprets documents written in SGML
- Probabilistic Ranking
- Flexible data representation
25CHESHIRE II (cont.)
- A big advantage of Cheshire
- dont have to write a special parser for special
document types - instead, simply create one DTD and the system
takes care of parsing the metafiles for us - A related advantage
- can create indexes on individual components of
the document - allows efficient title search, home page search,
domain-based search, without extra programming
26Cha-Cha Document Type Definition
lt!SGML "ISO 88791986" -- -- CHARSET
BASESET "ISO 6461983//CHARSET
International Reference Version (IRV)//ESC 2/5
4/0" DESCSET 0 9 UNUSED
9 2 9 11 2 UNUSED
13 1 13 14
18 UNUSED 32 95 32
127 1 UNUSED BASESET "ISO
Registration Number 100//CHARSET
ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC
2/13 4/1" DESCSET 128 32 UNUSED
160 95 32 255 1 UNUSED
27Cha-Cha DTD, cont. (parts omitted)
lt!doctype METADATA lt!-- This is a DTD for
metadata records extracted from the HTML
files in the cha-cha system. The tagging is
simple with nothing particular about it. The
structure has been kept flat within the
individual records. The only somewhat
interesting thing is the TEXT-REF tag which
is used to contain a reference to the full
text of entry stored in the raw HTML form.
--gt lt!ELEMENT METADATA o o (METAFILE)gt
28Cha-Cha DTD, cont. (parts omitted)
lt!-- We allow most elements to occur any number
of times in any order --gt lt!-- this is because
there is little consistency in the actual usage.
--gt lt!ELEMENT METAFILE - - (URL, TITLE, DATE,
SIZE, INLINKCOUNT, INLINKS, OUTLINKCOUNT,
OUTLINKS, DEPTH?, SHORTESTPATHSCOUNT?,
SHORTESTPATHS?, MIRRORCOUNT?, MIRRORURLS?,
TYPE?, DOMAIN?, FILE?)gt lt!-- We won't make any
assumptions about content... all PCDATA
--gt lt!ELEMENT URL - o (PCDATA)gt lt!ELEMENT DATE
- o (PCDATA)gt lt!ELEMENT TITLE - o
(PCDATA)gt lt!ELEMENT SIZE - o (PCDATA)gt lt!ELEMEN
T INLINKCOUNT - o (PCDATA)gt lt!ELEMENT INLINKS
- o (PCDATA)gt lt!ELEMENT OUTLINKCOUNT - o
(PCDATA)gt lt!ELEMENT OUTLINKS - o
(PCDATA)gt lt!ELEMENT DEPTH - o
(PCDATA)gt lt!ELEMENT SHORTESTPATHSCOUNT - o
(PCDATA)gt lt!ELEMENT SHORTESTPATHS - o
(PCDATA)gt
29Cha-Cha Online Processing
30Responding to the User Query
- User searches on pam samuelson
- Search Engine looks up documents indexed with one
or both terms in its inverted index - Search Engine looks up titles and shortest paths
in the metadata index - User Interface combines the information and
presents the results as HTML
31Building the Outline View
- Main issue how to combine shortest paths
- There are approximately three shortest paths per
web page - We assume users do not want to see the page
multiple times - Strategy
- Group hits together within the hierarchy
- Try to avoid showing subhierarchies with
singleton hits - This assumption is based on part on evidence from
our earlier clustering research that relevant
documents tend to cluster near one another
32Building the Outline View (cont.)
- Goals of the algorithm
- (I) Group (recursively) as many pages together
within a subhierarchy as possible - Avoid (recursively) branches that terminate in
only one hit (leaf) - (II) Remove as many internal nodes as possible
while while stil retaining at least one valid
path to every leaf - (iii) Remove as many edges as possible while
retaining at lesat one path to every leaf
33Building the Outline View (cont.)
- To achieve these goals we need a non-standard
graph algorithm - To do it properly, every possible subset of nodes
at depth D should be considered to determine the
minimal subset which covers all nodes at depth
D1 - This is inefficient -- would require 2k checks
for k nodes at depth D - Instead, we use a heuristic approach which
approximates the optimal results
34Building the Outline View (cont.)
- First, a top-down pass
- record depth of each node and the number of
children it links to directly - Second, a bottom-up pass
- identify the deepest nodes (the leaves)
- D lt- the set of nodes that are parents of leaves
- Sort D ascending according to how many active
children they link to at depth D1 - A node is active if it has not been eliminated
35Building the Outline View (cont.)
- Bottom-up pass, continued
- every node is a candidate to be eliminated
- those nodes with the least number of children are
eliminated first - because of goal (I)
- for each candidate C, if C links to one or more
active nodes at depth D1 that are not covered by
any active nodes, then C cannot be eliminated.
Otherwise, C is removed from the active list - After a level D is complete, there are no active
nodes at depth D that cover exclusively nodes
that are also covered by another node at depth D
36Building the Outline View (cont.)
- Retaining rank ordering
- Build up the tree by first placing in the tree
the hit (leaf) that is highest ranked - As more leaves are added, more parts of the
hierarchy are added, but the order in which the
parts of the hierarchy are added is retained - When the hierarchy has been built, it is
traversed to create the HTML listing
37Summary
- Better user interfaces for search should
- Help users understand starting points/sources
- Places results of search into an organizing
context - One (of many) approaches
- Cha-Cha simultaneously browse and search
intranet site context - Future work
- Special handling for short queries
- Spelling corrections suggestions
- Smarter paths