Most of the IR portion of this material is take from th - PowerPoint PPT Presentation

About This Presentation
Title:

Most of the IR portion of this material is take from th

Description:

Most of the IR portion of this material is take from the course 'Information ... Paris Hilton Hotel. Precision = fraction of retrieved items that are relevant. Yahoo ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 39
Provided by: loisdel
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: Most of the IR portion of this material is take from th


1
Lecture 9 Unstructured Data
  • Information Retrieval
  • Types of Systems, Documents, Tasks
  • Evaluation Precision, Recall
  • Search Engines (Google)
  • Architecture
  • Web Crawling
  • Query Processing
  • Inverted Indexes
  • PageRank (!)
  • Most of the IR portion of this material is take
    from the course "Information retrieval on the
    Internet" by Maier and Price, taught at PSU in
    alternate years.

2
Leaarning Objectives
  • LO9.1 Given a Transition matrix draw a transition
    graph, and vice versa.
  • LO9.2 Given a Transition matrix, and a residence
    vector, decide if it is the PageRank for that
    matrix.

3
Information Retrieval (IR)
  • The study of Unstructured Data is called
    Information Retrieval (IR)
  • A Database refers to Structured Data

4
General types of IR systems
  • Web Pages
  • Full text documents
  • Bibliographies
  • Distributed variations
  • Metasearch
  • Virtual document collections

5
Types of Documents in IR Systems
  • Hyperlinked or not
  • Format
  • HTML
  • PDF
  • Word Processed
  • Scanned OCR
  • Type
  • Text
  • Multimedia
  • Semistructured, e.g., XML
  • Static or Dynamic

6
Types of tasks in IR systems
  • Find
  • an overview
  • a fact/answer a question
  • comprehensive information
  • a known item (document, page or site)
  • a site to execute a transaction (e.g., buy a
    book, download a file)

7
Evaluation
  • How can we evaluate performance of an IR system?
  • System perspective
  • User perspective
  • User perspective Relevance
  • (How well) does a document satisfy a user's need?
  • Ideally, an IR system will retrieve exactly those
    items that satisfy the user's needs, no more, no
    less.
  • More wastes user's time
  • Less user misses valuable information

8
Notation
  • In response to a users query
  • The IR system
  • reTrieves a set of documents T
  • The user
  • knows the set of reLevant documents L
  • X denotes the number of documents in X
  • Ideally, T L, no more (no junk), no less(no
    missing)

9
The big picture
Retrieved, Not Relevant Junk
Relevant, Not Retrieved Missing
T
T?L
L
  • T?L ? T
  • 1 if No Junk
  • Precision
  • fraction of retrieved items that were relevant
  • 1 if all retrieved items were relevant
  • T?L ? L
  • 1 if No Missing
  • Recall
  • fraction of relevant items that were retrieved
  • 1 if all the relevant items were retrieved

10
Context
  • Precision, Recall were created for IR systems
    that retrieved from a small set of items.
  • In that case one could calculate T and L.
  • Web search engines do not fit this model well T
    and L are huge.
  • Recall does not make sense in this model, but we
    can apply the definition of precision_at_10,
    measuring the fraction of relevant items that
    were retrieved among the first 10 displayed.

11
Experiment
  • Compute Precision_at_10,20 for Google, Bing and
    Yahoo for this query
  • Paris Hilton Hotel
  • Precision fraction of retrieved items that are
    relevant

12
Search Engine Architecture
  • How often do you google?
  • What happens when you google?
  • http//www.google.com/corporate/tech.html
  • Average time half a second
  • We need a crawler to create the indexes and docs.
  • Notice that the web crawler creates the docs.
  • From the docs, the indexes are created and the
    docs are given ranks cf. later slides.
  • Let's study the Web Crawler Algorithm (WCA)
  • Page 1143 of the handout

13
Web Crawler Algorithm
  • Input Set of popular URLs S
  • Output Repository of visited web pages R
  • Method
  • If S is empty, end
  • Select page p from S to crawl, delete p from S
  • Get p (page that p points to).
  • If p is in R, return to (1),
  • Else add p to R, and add to S all outlinks from
    p unless they are already in R or S
  • Return to step (1)

14
WCA Terminating Search
  • Limit the number of pages crawled
  • Total number of pages, or
  • Pages per site
  • Limit the depth of the crawl

15
WCA Managing the Repository
  • Don't add duplicates to S
  • Need an index on S, probably hash
  • Don't add duplicates to R
  • Cannot happen since we search each URL only once?
  • A page can come from gt1 URL mirror sites
  • So use hash table of pages in R

16
WCA Select Next Page in S?
  • Can use Random Search
  • Better Most Important First
  • Can consider first set of pages to be most
    important
  • As pages are added, make them less important
  • Breadth first search
  • Can do a simplified PageRank (cf. later)
    calculation

17
WCA Faster, Faster
  • Multiprogramming, Multiprocessing
  • Must manage locks on S
  • With billions of URLs, this becomes a bottlneck
  • So assign each process to a host/site, not a URL
  • This can become a denial-of-service attack, so
    throttle down and take on several sites,
    organized by hash buckets
  • R also has bottleneck problems, and can be
    handled with locks

18
On to Query Processing
  • Very different from structured data no SQL,
    parser, optimizer
  • Input is boolean combination of keywords
  • data and base
  • data OR base
  • Google's goal is an engine that "understands
    exactly what you mean and gives you back exactly
    what you want "

19
Inverted Indexes
  • When the crawl is complete, the search engine
    builds, for each and every word, an inverted
    index.
  • An inverted index is a list of all documents
    containing that word
  • The index may be a bit vector
  • It may also contain the location(s) of the word
    in the document
  • Word any word in any language, plus misspelling,
    plus any sequence of characters surrounded by
    punctuation!
  • ?Hundreds of millions of words
  • ?Farms of PCs, e.g. near Bonneville Dam, to hold
    all this data

20
Mechanics of Query Processing
  • Relevant inverted indexes are found
  • Typically the indexes are in memory, otherwise
    this could take a full half second
  • If they are bit vectors, they are ANDed or ORed,
    then materialized, then lists are handled
  • Result is many URLs.
  • Next step is to determine their rank so the
    highest ranked URLs can be delivered to the user.

21
Ranking Pages
  • Indexes have returned pages. Which ones are most
    relevant to you?
  • There are many criteria for ranking pages here
    are some no-brainers (except !)
  • Presence of all words
  • All words close together
  • Words in important locations and formats on the
    page
  • ! Words near anchor text of links in reference
    pages
  • But the killer criteria is PageRank

22
PageRank Intuition
  • You need to find a plumber. How do you do it?
  • Call plumbers and talk to them
  • ! Call friends and ask for plumber references
  • Then choose plumbers who have the most references
  • !! Call friends who know a lot about plumbers
    (important friends) and ask them for plumber
    references
  • Then choose plumbers who have the most references
    from important people.
  • Technique 1 was used before Google.
  • Google introduced technique 2 to search engines
  • Google also introduced technique 3
  • Techniques 2, and especially 3, wiped out the
    competition.
  • The big challenge determine which pages are
    important

23
What does this mean for pages?
  • Most search engines look for pages containing the
    word "plumber"
  • Google searches for pages that are linked to by
    pages containing "plumber".
  • Google searches for pages that are linked to by
    important pages containing "plumber".
  • A web page is important if many important pages
    link to it.
  • This is a recursive equation.
  • Google solves it by imagining a web walker.

24
The Web Walker
  • From page p, the walker follows a random link in
    p
  • Note that all links in p have equal weight
  • The walker walks for a very, very, long time.
  • A residence vector y a m describes the
    percentage of time that the walker spends on each
    page
  • What does the vector 1/3 1/3 1/3 mean?
  • In steady state, the residence vector will be
    (1st draft of) the PageRank
  • Observe pages with many in-links are visited
    often
  • Observe important pages are visited most often

25
Stochastic Transition Matrix
  • To describe the page walker's moves, we use a
    stochastic transition matrix.
  • Stochastic each column sums to 1
  • There are 3 web pages Yahoo, Amazon and
    Microsoft
  • This matrix means that the Yahoo page has 2
    outlinks, to Yahoo (a self-link) and to Amazon,
    etc.

Y A M
½ ½ 0 ½ 0 1 0
½ 0
Matrix
26
Transition Graph
  • Each Transition Matrix corresponds to a
    Transition Graph, e.g.

1/2
Y
1/2
1/2
1
M
A
1/2
27
LO9.1Transition Graph
  • What is the Transition Graph for this Matrix?

Y A M
0 ½ ? ? 0 ? ? ½ 0
28
Solving for Page Rank
  • For small dimension matrices it is simple to
    calculate the PageRank using Gaussian
    Elimination.
  • Remember y,a,m is the time the walker spends at
    each site. Since it is a probability
    distribution, yam1. Since the walker has
    reached steady state,

½ ½ 0 ½ 0 1 0
½ 0
y a m
y a m

29
Solving, ctd
  • Solving such small equations is easy, but in
    reality the matrix dimension is the number of
    pages in the web, so it is in the billions.
  • There is a simpler way, called relaxation.
  • Start with a distribution, typically equal
    values, and transform it by the matrix.

½ ½ 0 ½ 0 1 0
½ 0
1/3 1/3 1/3
2/6 3/6 1/6

30
Solving, ctd
  • If we repeat this only 5-10 times the vectors
    converge to values very close to 2/5,2/5,1/5.
    Check that this is a solution

½ ½ 0 ½ 0 1 0
½ 0
2/5 2/5 1/5
2/5 2/5 1/5
  • This solution gives the PageRank of each page on
    the Web.
  • It is also called the eigenvector of the matrix
    with eigenvalue one.
  • Does this agree with our intuition about Page
    Rank?
  • For real web values, at most 100 iterations
    suffice

31
LO9.2 Identify Solution
  • Is 3/8, 1/4, 3/8 a solution for this
    transition matrix ?

0 ½ ? ? 0 ? ? ½ 0
32
A Spider Trap
  • Let's look at a more realistic example called a
    spider trap.

½ ½ 0 ½ 0 0 0
½ 1
M
  • The Transition Graph is
  • M represents any set of web pages that does not
    have a link outside the set.

1/2
Y
1/2
1
1/2
A
M
1/2
33
A Spider Trap
  • The Page Rank is

½ ½ 0 ½ 0 0 0
½ 1
0 0 1
0 0 1
  • Relaxation arrives at this vector because a
    random walker arrives at M and stays there in a
    loop.
  • This Page Rank vector violates the Page Rank
    principle that inlinks should determine
    importance.

34
A Dead End
  • A similar example, called a dead end, is

½ ½ 0 ½ 0 0 0
½ 0
M
  • The Transition Graph is
  • M represents any set of web pages that does not
    have out-links.

1/2
Y
1/2
1/2
A
M
1/2
35
A Dead End, ctd
  • A dead end matrix is not stochastic, because M
    does not obey the stochastic rule.
  • The only eigenvector for a dead end matrix is the
    zero vector.
  • Relaxation arrives at the zero vector because a
    random walker arrives at M and then has nowhere
    to go.

36
What to do?
  • In these cases, which happen all the time on the
    web, the web walker algorithm does not identify
    which pages are truly important.
  • But we can tweak the algorithm to do so Every
    5th walk, or so, the walker steps to a random
    page on the web.
  • Then the walk (spider trap example) becomes

½ ½ 0 ½ 0 0 0
½ 1
1/3 1/3 1/3
Pnew 0.8
Pold 0.2
37
Teleporter
  • Now our tweaked random walker is a teleporter.
  • With probability 80 s/he follows a random link
    from the current page, as before.
  • But with probability 20 s/he teleports to a
    random page with uniform probability.
  • It could be anywhere on the web, even the current
    page
  • If s/he is at a dead end, with 100 probability
    s/he teleports to a random page with uniform
    probability.
  • 80-20 are tunable paramaters

38
Solving the Teleporter Equation
  • The equation on slide 36 describes the
    teleporter's walk. It can be solved using
    relaxation or Gaussian elimination.
  • The solution is (7/33, 5/33, 21/33) .
  • It gives unreasonably high importance to M, but
    does recognize that Y is more important than A.
Write a Comment
User Comments (0)
About PowerShow.com