Title: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo Inc. Search and Marketplace
1Web Search TutorialJan Pedersen and Knut Magne
RisvikYahoo! Inc. Search and Marketplace
2Agenda
- A Short History
- Internet Search Fundamentals
- Web Pages
- Indexing
- Ranking and Evaluation
- Third Generation Technologies
3A Short History
4Precursors
- Information Retrieval (IR) Systems
- online catalogs, and News
- Limited scale, homogeneous text
- recall focus
- empirical
- Driven by results on evaluation collections
- free text queries shown to win over Boolean
- Specialized Internet access
- Gopher, Wais, Archie
- FTP archives and special databases
- Never achieved critical mass
5First Generation Systems
- 1993 Mosaic opens the WWW
- 1993 Architext/Excite (Stanford/Kleiner Perkins)
- 1994 Webcrawler (full text Indexing)
- 1994 Yahoo! (human edited Directory)
- 1994 Lycos (400K indexed pages)
- 1994 Infoseek (subscription service)
- Power systems
- 1994 AltaVista (Dec Labs, advanced query syntax,
large index) - 1996 Inktomi (massively distributed solution)
6Second Generation Systems
- Relevance matters
- 1998 Direct Hit (clickthrough based re-ranking)
- 1998 Google (link authority based re-ranking)
- Size matters
- 1999 FAST/AllTheWeb (scalable architecture)
- The user matters
- 1996 Ask Jeeves (question answering)
- Money matters
- 1997 Goto/Overture (pay-for-performance search)
7Third Generation Systems
- Market consolidation
- 2002 Yahoo! Purchases Inktomi
- 2003 Overture purchases AV and FAST/AllTheWeb
- 2003 MSN announces intention to build a Search
Engine - Search matures
- 2B market projected to grow to 6B by 2005
- required capital investment limits new players
- Gigablast?
- traffic focused in a few sites
- Yahoo!, MSN, Google, AOL
- consumer use driven by Brand marketing
8Web Search Fundamentals
9Web Fundamentals
User Browser
Web Server
URL
HTTP Request
HTML Page
Hyper Links
Page Rendering
Page Serving
10Definitions
- URLs refer to WWW content
- referential integrity is not guaranteed
- roughly 10 of Urls go 404 every month
- HTTP requests fetch content from a server
- stateless protocol
- cookies provide partial state
- Web servers generate HTML pages
- can be static or dynamic (output of a program)
- markup tags determine page rendering
- HTML pages contain hyperlinks
- link consists of a url and anchor text
11Urls
- URL Definition
- http//hostport/pathparams?queryfragment
- fragment is not considered part of the URL
- params are considered part of the path
- params are not frequently used
- Examples
- http//www.cnn.com/
- http//ad.doubleclick.net/jumpsz120x60ptile6o
rd6981062172 - http//us.imdb.com/Title?0068646
- http//www.sky.com/skynews/article/0,,30000-122610
27,00.html
12Dynamic Urls
- Urls with Dynamic Components
- Path (including params) and host are not dynamic
- If you change the PATH and/or host you will get a
404 or similar error - Query is dynamic
- If you change the query part, you will get a
valid page back - source of potentially infinite number of pages
- Examples
- http//www.cnn.com/index.html?test
- Returns a valid 200 page, even if test is not a
valid query term - http//www.cnn.com/index.htmltest
- Returns a 404 error page
- Not all Urls Follow this Convention
- http//www.internetnews.com/xSP/article.php/137873
1
13Dynamic Content
- Content Depends on External (to URL) Factors
- Cookies
- IP
- Referrer
- User-Agent
- Examples
- http//my.yahoo.com/
- http//forum.doom9.org/forumdisplay.php?saf9ddb31
710c7b314b75262c1031d8afforumid65 - Dynamic Urls and Dynamic Content are Orthogonal
- static urls can refer to dynamic content
- dynamic urls can refer to static content
14HMTL Sample
Andreas S. WEIGEND,
PhD face"Verdana,Tahoma,Arial" size2 size"4" face"Verdana, Arial, Helvetica,
sans-serif"Andreas S. WEIGEND, size"3" face"Verdana, Arial, Helvetica,
sans-serif"Ph.D.Arial, Helvetica, sans-serif"
size"2"Chief Scientist, Amazon.com
face"Verdana, Arial, Helvetica, sans-serif"
size"2"quotSophisticated algorithms
have always been a big part of creating the
Amazon.com customer experience.quot
(Jeff Bezos, Founder and CEO of
Amazon.com)
sans-serif" size"2" om"Amazon.com might be the world's
largest laboratory to study human behavior and
decision making. It for sure is a place with
very smart people, with a healthy attitude
towards data, measurement, and modeling. I am
responsible for research in machine learning
and computational marketing. Applications range
from real-time predictions of customer
intent and satisfaction, to personalization and
long-term optimization of pricing and
promotions. href"http//www.weigend.com/amazonjobs.html"
onclick"window.open(this.href)return
false"Job openings. I'm also
the point person for academic relations. size2 Helvetica, sans-serif"
Schedule Summer 2003
15Rendered Page
16WWW Size
- How pages are in the WWW?
- Lawrence and Giles, 1999 800M pages with most
pages not indexed - Dynamically generated pages imply effective size
is infinite - How many sites are registered?
- Churn due to SPAM
17Crawling
- Search Engine robot
- visits every page that will be indexed
- traversal behavior depends on crawl policy
- Index parameterized by size and freshness
- freshness is time since last revisit if page has
changed - Batch vs Incremental
- Batch crawl has several, distinct, batch
processing stages - discover, grab, index
- AV discovery phase takes 10 days, grab another
10, etc. - sharp freshness curve
- Incremental crawl
- crawler constantly operates, intermixing
discovery with grab - mild drop-off in freshness
18Typical Crawl/Build Architecture
19Relative Size
From SearchEngineShowdown Google claims 3B Fast
claims 2.5B AV claims 1B
20Freshness
From Search Engine Showdown Note hybrid
indices subindices with differing update rates
21Query Language
- Free text with implicit AND and implicit
proximity - Syntax-free input
- Explicit Boolean
- AND ()
- OR ()
- AND NOT (-)
- Explicit Phrasing ()
- Filters
- domain filetype
- host title
- link image
- url anchor
22Query Serving Architecture
- Index divided into segments each served by a node
- Each row of nodes replicated for query load
- Query integrator distributes query and merges
results - Front end creates a HTML page with the query
results
23Query Evaluation
- Index has two tables
- term to posting
- document ID to document data
- Postings record term occurrences
- may include positions
- Ranking employs posting
- to score documents
- Display employs document info
- fetched for top scoring documents
24Scale
- Indices typically cover billions of pages
- terrabytes of data
- Tens of millions of queries served every day
- translates to hundreds of queries per second
- User require rapid response
- query must be evaluated in under 300 msecs
- Data Centers typically employ thousands of
machines - Individual component failures are common
25Search Results Page
- Blended results
- multiple sources
- Relevance ranked
- Assisted search
- Spell correction
- Specialized indices
- via Tabs
- Sponsored listing
- monetization
- Localization
- Country language experience
26Relevance Evaluation
27Relevance is Everything
- The Search Paradigm 2.4 words, a few clicks,
and youre done - only possible if results are very relevant
- Relevance is speed
- time from task initiation to resolution
- important factors
- Location of useful result
- UI Clutter
- latency
- Relevance is relative
- context dependent
- e.g. football in the UK vs the US
- task dependent
- e.g. mafia when shopping vs researching
28Relevance is Hard to Measure
- Poorly defined, subjective notion
- depends on task, user context, etc.
- Analysts have Focused on Easier-to-Measure
Surrogates - index size, traffic, speed
- anecdotal relevance tests
- e.g. Vanity queries
- Requires Survey Methodology
- averaged over queries
- averaged over users
29Survey Methodologies
- Internal expert assessments
- assessments typically not replicated
- models absolute notion of relevance
- External consumer assessments
- assessments heavily replicated
- models statistical notion of relevance
- A/B surveys
- compare whole result sets
- visual relevance plays a large role
- Url surveys
- judge relevance of particular url for query
30A/B Test Design
- Strategy
- Compare two ranking algorithms by asking
panelists to compare pairs of search results - Queries
- 1000 semi-random queries, filtered for
family-friendly, understandability - Users can select from a list of 20 queries
- URLS
- Top 10 search results from 2 algorithms
- Voting
- 5 point scale, 7 replications
- Each user rates 6 queries, one of which is a
control query - Control query has AV results on one side, random
URLs on the other - Reject voters who take less than 10 seconds to
vote
31Query selection screen
32Rating screen
33A/B Test Scoring
- Test ran until we had 400 decisive votes
- Margin of error 5
- Compute
- Majority Vote count of queries where more than
half of the users said one engine was somewhat
better or much better - Total Vote count of users that rated a result
set somewhat better of better for each engine - Compare percentages
- test if one system out votes the other
- determine if the difference is statistically
significant
34Results
- Control Votes (error bar 1/sqrt(160) 7.9)
- Test One AV vs SE1 (error bar 1/sqrt(400)
5)
35Results
- Test Two AV Vs SE2 (with UI issue)
36Ranking
- Given 2.4 query terms, search 2B documents and
return 10 highly relevant in 300 msecs - Problem queries
- Travel (matches 32M documents)
- John Ellis (which one)
- Cobra (medical or animal)
- Query types
- Navigational (known item retrieval)
- Informational
- Ingredients
- Keyword match (title, abstract, body)
- Anchor Text (referring text)
- Quality (link connectivity)
- User Feedback (clickrate analysis)
37The Components of Relevance
- First Generation
- Keyword matching
- Title and abstract worth more
- Second Generation
- Computed document authority
- Based on link analysis
- Anchor text matching
- Webmaster voting
- Development Cycle
- Tune Ranking Evaluate Metrics
38Connectivity
39Connectivity Goals
- An indicator of authority
- As measured by static links
- Each link is a vote in favor of a site
- Webmasters are the voters
- Not all links are equal
- Links from authoritative sites are worth more
- Introduces an interesting circularity
- Votes from sites with many links are discounted
- Use your vote wisely
- Discount navigational links
- Not all links are editorial
- Account for link SPAM
40Connectivity Network
- What is authority score for nodes A and B?
- Inlink computes
- A 3
- B 2
- Page Rank Computes
- A .225
- B .295
A
B
41Definitions
- Connectivity Graph
- Nodes are pages (or hosts)
- Directed edges are links
- Graph edges can be represented as a transition
matrix, A - The ith row of A represents the links out from
node i - Authority score
- Score associated with each node
- Some function of inlinks to node and outlinks
from node - Simplest authority score is inlink count
42Page Rank (Without Random Jump)
- Contribution averaged over all outlinks
- Node score is the sum of contributions
- Fixed point equation
- If A is normalized
- Each row sums to 1.0
.1
.1
1/2
1/2
.1
A (.25)
B (.3)
43Page Rank Implications
- A is a stochastic matrix
- r(i) can be interpreted as a probability
- Suppose a surfer takes a outlink at random
- r(i) is the long run probability of landing at a
particular node - Solution to fixed point equation is the principal
Eigen vector - principal Eigen value is 1.0
- Solution can be found by iteration
- If then
- Start with random initial value for r
- Iterate multiplication by A
- Contribution of smaller eigen values will drop
out - Final value is a good estimate of the fixed point
solution
44Page Rank (with random jump)
- Whats the score for a node with no in-links?
- Revised equation
- Fixed point equation
- Probability interpretation
- As before with ? chance of jumping randomly
.1
.1
? 0.1
1/2
1/2
.1
A (.225)
B (.293)
45Eigenrank
- Separates internal from external links
- Internal transition matrix I
- External transition matrix E
- Introduces a new parameter
- ? is the random jump probability
- ? is the probability of taking an internal link
- (1 - ? - ?) is the probability of taking an
external link
46Eigenrank
- Revised equation
- Fixed point equation
- Probability interpretation
- ? chance of random jump
- ? chance of internal link
- (1-?-?) chance of external link
.1
.1
1/2
1/2
.1
A (.2)
B (.202)
47Computational Issues
- Nodes with no outlinks
- Transition matrix with zero row
- Internal or external
- Leave out of computation(?)
- Redistribute mass to random jump(?)
- Currently mass is redistributed
- Complex formula that prefers external links
48Kleinberg
- Two scores
- Authority score, a
- Hub score, h
- Fixed Point equations
- Authority
-
- Hub
- Principal Eigen vectors are solutions
49SPAM
- Manipulation of content purely to influence
ranking - Dictionary SPAM
- Link sharing
- Domain hi-jacking
- Link farms
- Robotic use of search results
- Meta-search engines
- Search Engine optimizers
- Fraud
50Third Generation Technologies
51Handling Ambiguity
Results for query Cobra
52Impression Tracking
Incoherent urls are those that receive high rank
for a large diversity of queries. Many
incoherent urls indicate SPAM or a bug (as in
this case).
53Clickrate Relevance Metric
Average highest rank clicked perceptibly
increased with the release of a new rank function.
54User Interface
- Ranked result lists
- Document summaries are critical
- Hit highlighting
- Dynamic abstracts
- url
- No recent innovation
- Graphical presentations not well fit to the task
- Blending
- Predefined segmentation
- e.g. Paid listing
- Intermixed with results from other sources
- e.g. News
55Future Trends
- Question Answering
- WWW as language model
- Enables simple methods
- e.g. Dumais et al. (SIGIR 2002)
- New contexts
- Ubiquitous Searching
- Toolbars, desktop, phone
- Implicit Searching
- Computed links
- New Tasks
- E.g. Local/ Country Search
56Bibliography
- Modeling the Internet and the Web Probabilistic
Methods and Algorithmsby Pierre Baldi, Paolo
Frasconi, and Padhraic SmythJohn Wiley Sons
May 28, 2003 - Mining the Web Analysis of Hypertext and Semi
Structured Databy Soumen ChakrabartiMorgan
Kaufmann August 15, 2002 - The Anatomy of a Large-scale Hypertextual Web
Search Engine by S. Brin and L. Page.7th
International WWW Conference, Brisbane,
Australia April 1998. - Websites
- http//www.searchenginewatch.com/
- http//www.searchengineshowdown.com/
- Presentations
- http//infonortics.com/searchengines/sh03/slides/e
vans.pdf