Most of the IR portion of this material is take from th - PowerPoint PPT Presentation

About This Presentation

Title:

Most of the IR portion of this material is take from th

Description:

Most of the IR portion of this material is take from the course 'Information ... Paris Hilton Hotel. Precision = fraction of retrieved items that are relevant. Yahoo ... – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 39

Provided by: loisdel

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: Most of the IR portion of this material is take from th

1
Lecture 9 Unstructured Data

Information Retrieval
Types of Systems, Documents, Tasks
Evaluation Precision, Recall
Search Engines (Google)
Architecture
Web Crawling
Query Processing
Inverted Indexes
PageRank (!)
Most of the IR portion of this material is take
from the course "Information retrieval on the
Internet" by Maier and Price, taught at PSU in
alternate years.

2
Leaarning Objectives

LO9.1 Given a Transition matrix draw a transition
graph, and vice versa.
LO9.2 Given a Transition matrix, and a residence
vector, decide if it is the PageRank for that
matrix.

3
Information Retrieval (IR)

The study of Unstructured Data is called
Information Retrieval (IR)
A Database refers to Structured Data

4
General types of IR systems

Web Pages
Full text documents
Bibliographies
Distributed variations
Metasearch
Virtual document collections

5
Types of Documents in IR Systems

Hyperlinked or not
Format
HTML
PDF
Word Processed
Scanned OCR
Type
Text
Multimedia
Semistructured, e.g., XML
Static or Dynamic

6
Types of tasks in IR systems

Find
an overview
a fact/answer a question
comprehensive information
a known item (document, page or site)
a site to execute a transaction (e.g., buy a
book, download a file)

7
Evaluation

How can we evaluate performance of an IR system?
System perspective
User perspective
User perspective Relevance
(How well) does a document satisfy a user's need?
Ideally, an IR system will retrieve exactly those
items that satisfy the user's needs, no more, no
less.
More wastes user's time
Less user misses valuable information

8
Notation

In response to a users query
The IR system
reTrieves a set of documents T
The user
knows the set of reLevant documents L
X denotes the number of documents in X
Ideally, T L, no more (no junk), no less(no
missing)

9
The big picture
Retrieved, Not Relevant Junk
Relevant, Not Retrieved Missing
T
T?L
L

T?L ? T
1 if No Junk
Precision
fraction of retrieved items that were relevant
1 if all retrieved items were relevant

T?L ? L
1 if No Missing
Recall
fraction of relevant items that were retrieved
1 if all the relevant items were retrieved

10
Context

Precision, Recall were created for IR systems
that retrieved from a small set of items.
In that case one could calculate T and L.
Web search engines do not fit this model well T
and L are huge.
Recall does not make sense in this model, but we
can apply the definition of precision_at_10,
measuring the fraction of relevant items that
were retrieved among the first 10 displayed.

11
Experiment

Compute Precision_at_10,20 for Google, Bing and
Yahoo for this query
Paris Hilton Hotel
Precision fraction of retrieved items that are
relevant

12
Search Engine Architecture

How often do you google?
What happens when you google?
http//www.google.com/corporate/tech.html
Average time half a second
We need a crawler to create the indexes and docs.
Notice that the web crawler creates the docs.
From the docs, the indexes are created and the
docs are given ranks cf. later slides.
Let's study the Web Crawler Algorithm (WCA)
Page 1143 of the handout

13
Web Crawler Algorithm

Input Set of popular URLs S
Output Repository of visited web pages R
Method
If S is empty, end
Select page p from S to crawl, delete p from S
Get p (page that p points to).
If p is in R, return to (1),
Else add p to R, and add to S all outlinks from
p unless they are already in R or S
Return to step (1)

14
WCA Terminating Search

Limit the number of pages crawled
Total number of pages, or
Pages per site
Limit the depth of the crawl

15
WCA Managing the Repository

Don't add duplicates to S
Need an index on S, probably hash
Don't add duplicates to R
Cannot happen since we search each URL only once?
A page can come from gt1 URL mirror sites
So use hash table of pages in R

16
WCA Select Next Page in S?

Can use Random Search
Better Most Important First
Can consider first set of pages to be most
important
As pages are added, make them less important
Breadth first search
Can do a simplified PageRank (cf. later)
calculation

17
WCA Faster, Faster

Multiprogramming, Multiprocessing
Must manage locks on S
With billions of URLs, this becomes a bottlneck
So assign each process to a host/site, not a URL
This can become a denial-of-service attack, so
throttle down and take on several sites,
organized by hash buckets
R also has bottleneck problems, and can be
handled with locks

18
On to Query Processing

Very different from structured data no SQL,
parser, optimizer
Input is boolean combination of keywords
data and base
data OR base
Google's goal is an engine that "understands
exactly what you mean and gives you back exactly
what you want "

19
Inverted Indexes

When the crawl is complete, the search engine
builds, for each and every word, an inverted
index.
An inverted index is a list of all documents
containing that word
The index may be a bit vector
It may also contain the location(s) of the word
in the document
Word any word in any language, plus misspelling,
plus any sequence of characters surrounded by
punctuation!
?Hundreds of millions of words
?Farms of PCs, e.g. near Bonneville Dam, to hold
all this data

20
Mechanics of Query Processing

Relevant inverted indexes are found
Typically the indexes are in memory, otherwise
this could take a full half second
If they are bit vectors, they are ANDed or ORed,
then materialized, then lists are handled
Result is many URLs.
Next step is to determine their rank so the
highest ranked URLs can be delivered to the user.

21
Ranking Pages

Indexes have returned pages. Which ones are most
relevant to you?
There are many criteria for ranking pages here
are some no-brainers (except !)
Presence of all words
All words close together
Words in important locations and formats on the
page
! Words near anchor text of links in reference
pages
But the killer criteria is PageRank

22
PageRank Intuition

You need to find a plumber. How do you do it?
Call plumbers and talk to them
! Call friends and ask for plumber references
Then choose plumbers who have the most references
!! Call friends who know a lot about plumbers
(important friends) and ask them for plumber
references
Then choose plumbers who have the most references
from important people.
Technique 1 was used before Google.
Google introduced technique 2 to search engines
Google also introduced technique 3
Techniques 2, and especially 3, wiped out the
competition.
The big challenge determine which pages are
important

23
What does this mean for pages?

Most search engines look for pages containing the
word "plumber"
Google searches for pages that are linked to by
pages containing "plumber".
Google searches for pages that are linked to by
important pages containing "plumber".
A web page is important if many important pages
link to it.
This is a recursive equation.
Google solves it by imagining a web walker.

24
The Web Walker

From page p, the walker follows a random link in
p
Note that all links in p have equal weight
The walker walks for a very, very, long time.
A residence vector y a m describes the
percentage of time that the walker spends on each
page
What does the vector 1/3 1/3 1/3 mean?
In steady state, the residence vector will be
(1st draft of) the PageRank
Observe pages with many in-links are visited
often
Observe important pages are visited most often

25
Stochastic Transition Matrix

To describe the page walker's moves, we use a
stochastic transition matrix.
Stochastic each column sums to 1
There are 3 web pages Yahoo, Amazon and
Microsoft
This matrix means that the Yahoo page has 2
outlinks, to Yahoo (a self-link) and to Amazon,
etc.

Y A M
½ ½ 0 ½ 0 1 0
½ 0
Matrix
26
Transition Graph

Each Transition Matrix corresponds to a
Transition Graph, e.g.

1/2
Y
1/2
1/2
1
M
A
1/2
27
LO9.1Transition Graph

What is the Transition Graph for this Matrix?

Y A M
0 ½ ? ? 0 ? ? ½ 0
28
Solving for Page Rank

For small dimension matrices it is simple to
calculate the PageRank using Gaussian
Elimination.
Remember y,a,m is the time the walker spends at
each site. Since it is a probability
distribution, yam1. Since the walker has
reached steady state,

½ ½ 0 ½ 0 1 0
½ 0
y a m
y a m

29
Solving, ctd

Solving such small equations is easy, but in
reality the matrix dimension is the number of
pages in the web, so it is in the billions.
There is a simpler way, called relaxation.
Start with a distribution, typically equal
values, and transform it by the matrix.

½ ½ 0 ½ 0 1 0
½ 0
1/3 1/3 1/3
2/6 3/6 1/6

30
Solving, ctd

If we repeat this only 5-10 times the vectors
converge to values very close to 2/5,2/5,1/5.
Check that this is a solution

½ ½ 0 ½ 0 1 0
½ 0
2/5 2/5 1/5
2/5 2/5 1/5

This solution gives the PageRank of each page on
the Web.
It is also called the eigenvector of the matrix
with eigenvalue one.
Does this agree with our intuition about Page
Rank?
For real web values, at most 100 iterations
suffice

31
LO9.2 Identify Solution

Is 3/8, 1/4, 3/8 a solution for this
transition matrix ?

0 ½ ? ? 0 ? ? ½ 0
32
A Spider Trap

Let's look at a more realistic example called a
spider trap.

½ ½ 0 ½ 0 0 0
½ 1
M

The Transition Graph is
M represents any set of web pages that does not
have a link outside the set.

1/2
Y
1/2
1
1/2
A
M
1/2
33
A Spider Trap

The Page Rank is

½ ½ 0 ½ 0 0 0
½ 1
0 0 1
0 0 1

Relaxation arrives at this vector because a
random walker arrives at M and stays there in a
loop.
This Page Rank vector violates the Page Rank
principle that inlinks should determine
importance.

34
A Dead End

A similar example, called a dead end, is

½ ½ 0 ½ 0 0 0
½ 0
M

The Transition Graph is
M represents any set of web pages that does not
have out-links.

1/2
Y
1/2
1/2
A
M
1/2
35
A Dead End, ctd

A dead end matrix is not stochastic, because M
does not obey the stochastic rule.
The only eigenvector for a dead end matrix is the
zero vector.
Relaxation arrives at the zero vector because a
random walker arrives at M and then has nowhere
to go.

36
What to do?

In these cases, which happen all the time on the
web, the web walker algorithm does not identify
which pages are truly important.
But we can tweak the algorithm to do so Every
5th walk, or so, the walker steps to a random
page on the web.
Then the walk (spider trap example) becomes

½ ½ 0 ½ 0 0 0
½ 1
1/3 1/3 1/3
Pnew 0.8
Pold 0.2
37
Teleporter

Now our tweaked random walker is a teleporter.
With probability 80 s/he follows a random link
from the current page, as before.
But with probability 20 s/he teleports to a
random page with uniform probability.
It could be anywhere on the web, even the current
page
If s/he is at a dead end, with 100 probability
s/he teleports to a random page with uniform
probability.
80-20 are tunable paramaters

38
Solving the Teleporter Equation

The equation on slide 36 describes the
teleporter's walk. It can be solved using
relaxation or Gaussian elimination.
The solution is (7/33, 5/33, 21/33) .
It gives unreasonably high importance to M, but
does recognize that Y is more important than A.

Write a Comment

User Comments (0)