Title: Most of the IR portion of this material is take from th
1Lecture 9 Unstructured Data
- Information Retrieval
- Types of Systems, Documents, Tasks
- Evaluation Precision, Recall
- Search Engines (Google)
- Architecture
- Web Crawling
- Query Processing
- Inverted Indexes
- PageRank (!)
- Most of the IR portion of this material is take
from the course "Information retrieval on the
Internet" by Maier and Price, taught at PSU in
alternate years.
2Leaarning Objectives
- LO9.1 Given a Transition matrix draw a transition
graph, and vice versa. - LO9.2 Given a Transition matrix, and a residence
vector, decide if it is the PageRank for that
matrix.
3Information Retrieval (IR)
- The study of Unstructured Data is called
Information Retrieval (IR) - A Database refers to Structured Data
4General types of IR systems
- Web Pages
- Full text documents
- Bibliographies
- Distributed variations
- Metasearch
- Virtual document collections
5Types of Documents in IR Systems
- Hyperlinked or not
- Format
- HTML
- PDF
- Word Processed
- Scanned OCR
- Type
- Text
- Multimedia
- Semistructured, e.g., XML
- Static or Dynamic
6Types of tasks in IR systems
- Find
- an overview
- a fact/answer a question
- comprehensive information
- a known item (document, page or site)
- a site to execute a transaction (e.g., buy a
book, download a file)
7Evaluation
- How can we evaluate performance of an IR system?
- System perspective
- User perspective
- User perspective Relevance
- (How well) does a document satisfy a user's need?
- Ideally, an IR system will retrieve exactly those
items that satisfy the user's needs, no more, no
less. - More wastes user's time
- Less user misses valuable information
8Notation
- In response to a users query
- The IR system
- reTrieves a set of documents T
- The user
- knows the set of reLevant documents L
- X denotes the number of documents in X
- Ideally, T L, no more (no junk), no less(no
missing)
9The big picture
Retrieved, Not Relevant Junk
Relevant, Not Retrieved Missing
T
T?L
L
- T?L ? T
- 1 if No Junk
- Precision
- fraction of retrieved items that were relevant
- 1 if all retrieved items were relevant
- T?L ? L
- 1 if No Missing
- Recall
- fraction of relevant items that were retrieved
- 1 if all the relevant items were retrieved
10Context
- Precision, Recall were created for IR systems
that retrieved from a small set of items. - In that case one could calculate T and L.
- Web search engines do not fit this model well T
and L are huge. - Recall does not make sense in this model, but we
can apply the definition of precision_at_10,
measuring the fraction of relevant items that
were retrieved among the first 10 displayed.
11Experiment
- Compute Precision_at_10,20 for Google, Bing and
Yahoo for this query - Paris Hilton Hotel
- Precision fraction of retrieved items that are
relevant
12Search Engine Architecture
- How often do you google?
- What happens when you google?
- http//www.google.com/corporate/tech.html
- Average time half a second
- We need a crawler to create the indexes and docs.
- Notice that the web crawler creates the docs.
- From the docs, the indexes are created and the
docs are given ranks cf. later slides. - Let's study the Web Crawler Algorithm (WCA)
- Page 1143 of the handout
13Web Crawler Algorithm
- Input Set of popular URLs S
- Output Repository of visited web pages R
- Method
- If S is empty, end
- Select page p from S to crawl, delete p from S
- Get p (page that p points to).
- If p is in R, return to (1),
- Else add p to R, and add to S all outlinks from
p unless they are already in R or S - Return to step (1)
14WCA Terminating Search
- Limit the number of pages crawled
- Total number of pages, or
- Pages per site
- Limit the depth of the crawl
15WCA Managing the Repository
- Don't add duplicates to S
- Need an index on S, probably hash
- Don't add duplicates to R
- Cannot happen since we search each URL only once?
- A page can come from gt1 URL mirror sites
- So use hash table of pages in R
16WCA Select Next Page in S?
- Can use Random Search
- Better Most Important First
- Can consider first set of pages to be most
important - As pages are added, make them less important
- Breadth first search
- Can do a simplified PageRank (cf. later)
calculation
17WCA Faster, Faster
- Multiprogramming, Multiprocessing
- Must manage locks on S
- With billions of URLs, this becomes a bottlneck
- So assign each process to a host/site, not a URL
- This can become a denial-of-service attack, so
throttle down and take on several sites,
organized by hash buckets - R also has bottleneck problems, and can be
handled with locks
18On to Query Processing
- Very different from structured data no SQL,
parser, optimizer - Input is boolean combination of keywords
- data and base
- data OR base
- Google's goal is an engine that "understands
exactly what you mean and gives you back exactly
what you want "
19Inverted Indexes
- When the crawl is complete, the search engine
builds, for each and every word, an inverted
index. - An inverted index is a list of all documents
containing that word - The index may be a bit vector
- It may also contain the location(s) of the word
in the document - Word any word in any language, plus misspelling,
plus any sequence of characters surrounded by
punctuation! - ?Hundreds of millions of words
- ?Farms of PCs, e.g. near Bonneville Dam, to hold
all this data
20Mechanics of Query Processing
- Relevant inverted indexes are found
- Typically the indexes are in memory, otherwise
this could take a full half second - If they are bit vectors, they are ANDed or ORed,
then materialized, then lists are handled - Result is many URLs.
- Next step is to determine their rank so the
highest ranked URLs can be delivered to the user.
21Ranking Pages
- Indexes have returned pages. Which ones are most
relevant to you? - There are many criteria for ranking pages here
are some no-brainers (except !) - Presence of all words
- All words close together
- Words in important locations and formats on the
page - ! Words near anchor text of links in reference
pages - But the killer criteria is PageRank
22PageRank Intuition
- You need to find a plumber. How do you do it?
- Call plumbers and talk to them
- ! Call friends and ask for plumber references
- Then choose plumbers who have the most references
- !! Call friends who know a lot about plumbers
(important friends) and ask them for plumber
references - Then choose plumbers who have the most references
from important people. - Technique 1 was used before Google.
- Google introduced technique 2 to search engines
- Google also introduced technique 3
- Techniques 2, and especially 3, wiped out the
competition. - The big challenge determine which pages are
important
23What does this mean for pages?
- Most search engines look for pages containing the
word "plumber" - Google searches for pages that are linked to by
pages containing "plumber". - Google searches for pages that are linked to by
important pages containing "plumber". - A web page is important if many important pages
link to it. - This is a recursive equation.
- Google solves it by imagining a web walker.
24The Web Walker
- From page p, the walker follows a random link in
p - Note that all links in p have equal weight
- The walker walks for a very, very, long time.
- A residence vector y a m describes the
percentage of time that the walker spends on each
page - What does the vector 1/3 1/3 1/3 mean?
- In steady state, the residence vector will be
(1st draft of) the PageRank - Observe pages with many in-links are visited
often - Observe important pages are visited most often
25Stochastic Transition Matrix
- To describe the page walker's moves, we use a
stochastic transition matrix. - Stochastic each column sums to 1
- There are 3 web pages Yahoo, Amazon and
Microsoft - This matrix means that the Yahoo page has 2
outlinks, to Yahoo (a self-link) and to Amazon,
etc.
Y A M
½ ½ 0 ½ 0 1 0
½ 0
Matrix
26Transition Graph
- Each Transition Matrix corresponds to a
Transition Graph, e.g.
1/2
Y
1/2
1/2
1
M
A
1/2
27LO9.1Transition Graph
- What is the Transition Graph for this Matrix?
Y A M
0 ½ ? ? 0 ? ? ½ 0
28Solving for Page Rank
- For small dimension matrices it is simple to
calculate the PageRank using Gaussian
Elimination. - Remember y,a,m is the time the walker spends at
each site. Since it is a probability
distribution, yam1. Since the walker has
reached steady state,
½ ½ 0 ½ 0 1 0
½ 0
y a m
y a m
29Solving, ctd
- Solving such small equations is easy, but in
reality the matrix dimension is the number of
pages in the web, so it is in the billions. - There is a simpler way, called relaxation.
- Start with a distribution, typically equal
values, and transform it by the matrix.
½ ½ 0 ½ 0 1 0
½ 0
1/3 1/3 1/3
2/6 3/6 1/6
30Solving, ctd
- If we repeat this only 5-10 times the vectors
converge to values very close to 2/5,2/5,1/5.
Check that this is a solution
½ ½ 0 ½ 0 1 0
½ 0
2/5 2/5 1/5
2/5 2/5 1/5
- This solution gives the PageRank of each page on
the Web. - It is also called the eigenvector of the matrix
with eigenvalue one. - Does this agree with our intuition about Page
Rank? - For real web values, at most 100 iterations
suffice
31LO9.2 Identify Solution
- Is 3/8, 1/4, 3/8 a solution for this
transition matrix ?
0 ½ ? ? 0 ? ? ½ 0
32A Spider Trap
- Let's look at a more realistic example called a
spider trap.
½ ½ 0 ½ 0 0 0
½ 1
M
- The Transition Graph is
- M represents any set of web pages that does not
have a link outside the set.
1/2
Y
1/2
1
1/2
A
M
1/2
33A Spider Trap
½ ½ 0 ½ 0 0 0
½ 1
0 0 1
0 0 1
- Relaxation arrives at this vector because a
random walker arrives at M and stays there in a
loop. - This Page Rank vector violates the Page Rank
principle that inlinks should determine
importance.
34A Dead End
- A similar example, called a dead end, is
½ ½ 0 ½ 0 0 0
½ 0
M
- The Transition Graph is
- M represents any set of web pages that does not
have out-links.
1/2
Y
1/2
1/2
A
M
1/2
35A Dead End, ctd
- A dead end matrix is not stochastic, because M
does not obey the stochastic rule. - The only eigenvector for a dead end matrix is the
zero vector. - Relaxation arrives at the zero vector because a
random walker arrives at M and then has nowhere
to go.
36What to do?
- In these cases, which happen all the time on the
web, the web walker algorithm does not identify
which pages are truly important. - But we can tweak the algorithm to do so Every
5th walk, or so, the walker steps to a random
page on the web. - Then the walk (spider trap example) becomes
½ ½ 0 ½ 0 0 0
½ 1
1/3 1/3 1/3
Pnew 0.8
Pold 0.2
37Teleporter
- Now our tweaked random walker is a teleporter.
- With probability 80 s/he follows a random link
from the current page, as before. - But with probability 20 s/he teleports to a
random page with uniform probability. - It could be anywhere on the web, even the current
page - If s/he is at a dead end, with 100 probability
s/he teleports to a random page with uniform
probability. - 80-20 are tunable paramaters
38Solving the Teleporter Equation
- The equation on slide 36 describes the
teleporter's walk. It can be solved using
relaxation or Gaussian elimination. - The solution is (7/33, 5/33, 21/33) .
- It gives unreasonably high importance to M, but
does recognize that Y is more important than A.