Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo Inc. Search and Marketplace

1 / 56

About This Presentation

Title:

Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo Inc. Search and Marketplace

Description:

1993 Architext/Excite (Stanford/Kleiner Perkins) 1994 Webcrawler (full text Indexing) ... 1994 AltaVista (Dec Labs, advanced query syntax, large index) ... – PowerPoint PPT presentation

Number of Views:380

Avg rating:3.0/5.0

Slides: 57

Provided by: Yah97

Learn more at: http://people.ischool.berkeley.edu

more less

Transcript and Presenter's Notes

Title: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo Inc. Search and Marketplace

1
Web Search TutorialJan Pedersen and Knut Magne
RisvikYahoo! Inc. Search and Marketplace
2
Agenda

A Short History
Internet Search Fundamentals
Web Pages
Indexing
Ranking and Evaluation
Third Generation Technologies

3
A Short History
4
Precursors

Information Retrieval (IR) Systems
online catalogs, and News
Limited scale, homogeneous text
recall focus
empirical
Driven by results on evaluation collections
free text queries shown to win over Boolean
Specialized Internet access
Gopher, Wais, Archie
FTP archives and special databases
Never achieved critical mass

5
First Generation Systems

1993 Mosaic opens the WWW
1993 Architext/Excite (Stanford/Kleiner Perkins)
1994 Webcrawler (full text Indexing)
1994 Yahoo! (human edited Directory)
1994 Lycos (400K indexed pages)
1994 Infoseek (subscription service)
Power systems
1994 AltaVista (Dec Labs, advanced query syntax,
large index)
1996 Inktomi (massively distributed solution)

6
Second Generation Systems

Relevance matters
1998 Direct Hit (clickthrough based re-ranking)
1998 Google (link authority based re-ranking)
Size matters
1999 FAST/AllTheWeb (scalable architecture)
The user matters
1996 Ask Jeeves (question answering)
Money matters
1997 Goto/Overture (pay-for-performance search)

7
Third Generation Systems

Market consolidation
2002 Yahoo! Purchases Inktomi
2003 Overture purchases AV and FAST/AllTheWeb
2003 MSN announces intention to build a Search
Engine
Search matures
2B market projected to grow to 6B by 2005
required capital investment limits new players
Gigablast?
traffic focused in a few sites
Yahoo!, MSN, Google, AOL
consumer use driven by Brand marketing

8
Web Search Fundamentals
9
Web Fundamentals
User Browser
Web Server
URL
HTTP Request
HTML Page
Hyper Links
Page Rendering
Page Serving
10
Definitions

URLs refer to WWW content
referential integrity is not guaranteed
roughly 10 of Urls go 404 every month
HTTP requests fetch content from a server
stateless protocol
cookies provide partial state
Web servers generate HTML pages
can be static or dynamic (output of a program)
markup tags determine page rendering
HTML pages contain hyperlinks
link consists of a url and anchor text

11
Urls

URL Definition
http//hostport/pathparams?queryfragment
fragment is not considered part of the URL
params are considered part of the path
params are not frequently used
Examples
http//www.cnn.com/
http//ad.doubleclick.net/jumpsz120x60ptile6o
rd6981062172
http//us.imdb.com/Title?0068646
http//www.sky.com/skynews/article/0,,30000-122610
27,00.html

12
Dynamic Urls

Urls with Dynamic Components
Path (including params) and host are not dynamic
If you change the PATH and/or host you will get a
404 or similar error
Query is dynamic
If you change the query part, you will get a
valid page back
source of potentially infinite number of pages
Examples
http//www.cnn.com/index.html?test
Returns a valid 200 page, even if test is not a
valid query term
http//www.cnn.com/index.htmltest
Returns a 404 error page
Not all Urls Follow this Convention
http//www.internetnews.com/xSP/article.php/137873
1

13
Dynamic Content

Content Depends on External (to URL) Factors
Cookies
IP
Referrer
User-Agent
Examples
http//my.yahoo.com/
http//forum.doom9.org/forumdisplay.php?saf9ddb31
710c7b314b75262c1031d8afforumid65
Dynamic Urls and Dynamic Content are Orthogonal
static urls can refer to dynamic content
dynamic urls can refer to static content

14
HMTL Sample
Andreas S. WEIGEND,
PhD face"Verdana,Tahoma,Arial" size2 size"4" face"Verdana, Arial, Helvetica,
sans-serif"Andreas S. WEIGEND, size"3" face"Verdana, Arial, Helvetica,
sans-serif"Ph.D.Arial, Helvetica, sans-serif"
size"2"Chief Scientist, Amazon.com
face"Verdana, Arial, Helvetica, sans-serif"
size"2"quotSophisticated algorithms
have always been a big part of creating the
Amazon.com customer experience.quot
(Jeff Bezos, Founder and CEO of
Amazon.com)
sans-serif" size"2" om"Amazon.com might be the world's
largest laboratory to study human behavior and
decision making. It for sure is a place with
very smart people, with a healthy attitude
towards data, measurement, and modeling. I am
responsible for research in machine learning
and computational marketing. Applications range
from real-time predictions of customer
intent and satisfaction, to personalization and
long-term optimization of pricing and
promotions. href"http//www.weigend.com/amazonjobs.html"
onclick"window.open(this.href)return
false"Job openings. I'm also
the point person for academic relations. size2 Helvetica, sans-serif"
Schedule Summer 2003
15
Rendered Page
16
WWW Size

How pages are in the WWW?
Lawrence and Giles, 1999 800M pages with most
pages not indexed
Dynamically generated pages imply effective size
is infinite
How many sites are registered?
Churn due to SPAM

17
Crawling

Search Engine robot
visits every page that will be indexed
traversal behavior depends on crawl policy
Index parameterized by size and freshness
freshness is time since last revisit if page has
changed
Batch vs Incremental
Batch crawl has several, distinct, batch
processing stages
discover, grab, index
AV discovery phase takes 10 days, grab another
10, etc.
sharp freshness curve
Incremental crawl
crawler constantly operates, intermixing
discovery with grab
mild drop-off in freshness

18
Typical Crawl/Build Architecture
19
Relative Size
From SearchEngineShowdown Google claims 3B Fast
claims 2.5B AV claims 1B
20
Freshness
From Search Engine Showdown Note hybrid
indices subindices with differing update rates
21
Query Language

Free text with implicit AND and implicit
proximity
Syntax-free input
Explicit Boolean
AND ()
OR ()
AND NOT (-)
Explicit Phrasing ()
Filters
domain filetype
host title
link image
url anchor

22
Query Serving Architecture

Index divided into segments each served by a node
Each row of nodes replicated for query load
Query integrator distributes query and merges
results
Front end creates a HTML page with the query
results

23
Query Evaluation

Index has two tables
term to posting
document ID to document data
Postings record term occurrences
may include positions
Ranking employs posting
to score documents
Display employs document info
fetched for top scoring documents

24
Scale

Indices typically cover billions of pages
terrabytes of data
Tens of millions of queries served every day
translates to hundreds of queries per second
User require rapid response
query must be evaluated in under 300 msecs
Data Centers typically employ thousands of
machines
Individual component failures are common

25
Search Results Page

Blended results
multiple sources
Relevance ranked
Assisted search
Spell correction
Specialized indices
via Tabs
Sponsored listing
monetization
Localization
Country language experience

26
Relevance Evaluation
27
Relevance is Everything

The Search Paradigm 2.4 words, a few clicks,
and youre done
only possible if results are very relevant
Relevance is speed
time from task initiation to resolution
important factors
Location of useful result
UI Clutter
latency
Relevance is relative
context dependent
e.g. football in the UK vs the US
task dependent
e.g. mafia when shopping vs researching

28
Relevance is Hard to Measure

Poorly defined, subjective notion
depends on task, user context, etc.
Analysts have Focused on Easier-to-Measure
Surrogates
index size, traffic, speed
anecdotal relevance tests
e.g. Vanity queries
Requires Survey Methodology
averaged over queries
averaged over users

29
Survey Methodologies

Internal expert assessments
assessments typically not replicated
models absolute notion of relevance
External consumer assessments
assessments heavily replicated
models statistical notion of relevance
A/B surveys
compare whole result sets
visual relevance plays a large role
Url surveys
judge relevance of particular url for query

30
A/B Test Design

Strategy
Compare two ranking algorithms by asking
panelists to compare pairs of search results
Queries
1000 semi-random queries, filtered for
family-friendly, understandability
Users can select from a list of 20 queries
URLS
Top 10 search results from 2 algorithms
Voting
5 point scale, 7 replications
Each user rates 6 queries, one of which is a
control query
Control query has AV results on one side, random
URLs on the other
Reject voters who take less than 10 seconds to
vote

31
Query selection screen
32
Rating screen
33
A/B Test Scoring

Test ran until we had 400 decisive votes
Margin of error 5
Compute
Majority Vote count of queries where more than
half of the users said one engine was somewhat
better or much better
Total Vote count of users that rated a result
set somewhat better of better for each engine
Compare percentages
test if one system out votes the other
determine if the difference is statistically
significant

34
Results

Control Votes (error bar 1/sqrt(160) 7.9)

Test One AV vs SE1 (error bar 1/sqrt(400)
5)

35
Results

Test Two AV Vs SE2 (with UI issue)

Test Three SE1 Vs SE2

36
Ranking

Given 2.4 query terms, search 2B documents and
return 10 highly relevant in 300 msecs
Problem queries
Travel (matches 32M documents)
John Ellis (which one)
Cobra (medical or animal)
Query types
Navigational (known item retrieval)
Informational
Ingredients
Keyword match (title, abstract, body)
Anchor Text (referring text)
Quality (link connectivity)
User Feedback (clickrate analysis)

37
The Components of Relevance

First Generation
Keyword matching
Title and abstract worth more
Second Generation
Computed document authority
Based on link analysis
Anchor text matching
Webmaster voting
Development Cycle
Tune Ranking Evaluate Metrics

38
Connectivity
39
Connectivity Goals

An indicator of authority
As measured by static links
Each link is a vote in favor of a site
Webmasters are the voters
Not all links are equal
Links from authoritative sites are worth more
Introduces an interesting circularity
Votes from sites with many links are discounted
Use your vote wisely
Discount navigational links
Not all links are editorial
Account for link SPAM

40
Connectivity Network

What is authority score for nodes A and B?
Inlink computes
A 3
B 2
Page Rank Computes
A .225
B .295

A
B
41
Definitions

Connectivity Graph
Nodes are pages (or hosts)
Directed edges are links
Graph edges can be represented as a transition
matrix, A
The ith row of A represents the links out from
node i
Authority score
Score associated with each node
Some function of inlinks to node and outlinks
from node
Simplest authority score is inlink count

42
Page Rank (Without Random Jump)

Contribution averaged over all outlinks
Node score is the sum of contributions
Fixed point equation
If A is normalized
Each row sums to 1.0

.1
.1
1/2
1/2
.1
A (.25)
B (.3)
43
Page Rank Implications

A is a stochastic matrix
r(i) can be interpreted as a probability
Suppose a surfer takes a outlink at random
r(i) is the long run probability of landing at a
particular node
Solution to fixed point equation is the principal
Eigen vector
principal Eigen value is 1.0
Solution can be found by iteration
If then
Start with random initial value for r
Iterate multiplication by A
Contribution of smaller eigen values will drop
out
Final value is a good estimate of the fixed point
solution

44
Page Rank (with random jump)

Whats the score for a node with no in-links?
Revised equation
Fixed point equation
Probability interpretation
As before with ? chance of jumping randomly

.1
.1
? 0.1
1/2
1/2
.1
A (.225)
B (.293)
45
Eigenrank

Separates internal from external links
Internal transition matrix I
External transition matrix E
Introduces a new parameter
? is the random jump probability
? is the probability of taking an internal link
(1 - ? - ?) is the probability of taking an
external link

46
Eigenrank

Revised equation
Fixed point equation
Probability interpretation
? chance of random jump
? chance of internal link
(1-?-?) chance of external link

0.1
? 0.1

.1
.1
1/2
1/2
.1
A (.2)
B (.202)
47
Computational Issues

Nodes with no outlinks
Transition matrix with zero row
Internal or external
Leave out of computation(?)
Redistribute mass to random jump(?)
Currently mass is redistributed
Complex formula that prefers external links

48
Kleinberg

Two scores
Authority score, a
Hub score, h
Fixed Point equations
Authority
Hub
Principal Eigen vectors are solutions

49
SPAM

Manipulation of content purely to influence
ranking
Dictionary SPAM
Link sharing
Domain hi-jacking
Link farms
Robotic use of search results
Meta-search engines
Search Engine optimizers
Fraud

50
Third Generation Technologies
51
Handling Ambiguity
Results for query Cobra
52
Impression Tracking
Incoherent urls are those that receive high rank
for a large diversity of queries. Many
incoherent urls indicate SPAM or a bug (as in
this case).
53
Clickrate Relevance Metric
Average highest rank clicked perceptibly
increased with the release of a new rank function.
54
User Interface

Ranked result lists
Document summaries are critical
Hit highlighting
Dynamic abstracts
url
No recent innovation
Graphical presentations not well fit to the task
Blending
Predefined segmentation
e.g. Paid listing
Intermixed with results from other sources
e.g. News

55
Future Trends

Question Answering
WWW as language model
Enables simple methods
e.g. Dumais et al. (SIGIR 2002)
New contexts
Ubiquitous Searching
Toolbars, desktop, phone
Implicit Searching
Computed links
New Tasks
E.g. Local/ Country Search

56
Bibliography

Modeling the Internet and the Web Probabilistic
Methods and Algorithmsby Pierre Baldi, Paolo
Frasconi, and Padhraic SmythJohn Wiley Sons
May 28, 2003
Mining the Web Analysis of Hypertext and Semi
Structured Databy Soumen ChakrabartiMorgan
Kaufmann August 15, 2002
The Anatomy of a Large-scale Hypertextual Web
Search Engine by S. Brin and L. Page.7th
International WWW Conference, Brisbane,
Australia April 1998.
Websites
http//www.searchenginewatch.com/
http//www.searchengineshowdown.com/
Presentations
http//infonortics.com/searchengines/sh03/slides/e
vans.pdf