Under The Hood Part II WebBased Information Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Under The Hood Part II WebBased Information Architectures

Description:

Under The Hood [Part II] Web-Based Information Architectures. MSEC 20-760. Mini II. Jaime Carbonell. Today's Topics. Term weighting in detail ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 28
Provided by: cjin
Category:

less

Transcript and Presenter's Notes

Title: Under The Hood Part II WebBased Information Architectures


1
Under The Hood Part IIWeb-Based Information
Architectures
  • MSEC 20-760Mini II
  • Jaime Carbonell

2
Todays Topics
  • Term weighting in detail
  • Generalized Vector Space Model (GVSM)
  • Maximal Marginal Relevance
  • Summarization as Passage Retrieval

3
Term Weighting Revisited (1)
  • Definitions
  • wi "ith Term" a word, stemmed word, or
  • indexed phrase
  • Dj "jth Document" a unit of indexed text, e.g.
    a web-page, a news report, an article, a patent,
    a legal case, a book, a chapter of a book, etc.

4
Term Weighting Revisited (2)
  • Definitions
  • C "The Collection" the full set of indexed
    documents
    (e.g. the New York Times
    archive, the Web, ...)
  • Tf(wi ,Dj) "Term Frequency" the number of times
    wi occurs in document Dj. Tf is sometimes
    normalized by dividing by frequency of the
    most-frequent non-stop term in the document Tf
    norm Tf/ max_TF .

5
Term Weighting Revisited (3)
  • Definitions
  • Df(wi ,C) "Document Frequency" the number of
  • documents from C in which wi occurs. Df
  • may be normalized by dividing it by the
  • total number of documents in C.
  • IDf(wi, C) Inverse Document Frequency
  • Df(wi, C)/size(C)-1 . Most often the
    log2(IDf) is used, rather than IDf directly.

6
Term Weighting Revisited (4)
  • TfIDf Term Weights
  • In general TfIDf(wi, Dj, C)
  • F1(Tf(wi, Dj) F2(IDf(wi, C))
  • Usually F1 0.5 log2(Tf), or Tf/Tfmax
  • or 0.5 0.5Tf/Tfmax
  • Usually F2 log2(IDf)
  • In the SMART IR system TfIDf(wi, Dj,C)
  • 0.5 0.5Tf(wi, Dj/Tfmax(Dj) log2(IDf(wi, C))

7
Term Weighting beyond TfIDf (1)
  • Probabilistic Models
  • Old style (see textbooks)
  • Improves precision-recall slightly
  • Full statistical language modeling (CMU)
  • Improves precision-recall more significantly
  • Difficult to compute efficiently.

8
Term Weighting beyond TfIDf (2)
  • Neural Networks
  • Theoretically attractive
  • Do not scale up at all, unfortunately
  • Fuzzy Sets
  • Not deeply researched, scaling difficulties

9
Term Weighting beyond TfIDf (3)
  • Natural Language Analysis
  • Analyze and understand Ds Q first
  • Ultimate IR method, in theory
  • Generally NL understanding is an unsolved problem
  • Scale up challenges, even if we could do it
  • But, shown to improve IR for very limited domains

10
Generalized Vector Space Model (1)
  • Principles
  • Define terms by their occurrence patterns in
    documents
  • Define query terms in the same way
  • Compute similarity by document-pattern overlap
    for terms in D and Q
  • Use standard Cos similarity and either binary or
    TfIDf weights

11
Generalized Vector Space Model (2)
  • Advantages
  • Automatically calculates partial similarity
  • If "heart disease" and "stroke" and
    "ventricular" co-occur in many documents, then if
    the query contains only one of these terms,
    documents containing the other will receive
    partial credit proportional to their document
    co-occurrence ratio.
  • No need to do query expansion or relevance
    feedback

12
Generalized Vector Space Model (3)
  • Disadvantages
  • Computationally expensive
  • Performance vector space Q expansion

13
GVSM, How it Works (1)
  • Represent the collection as vector of documents
  • Let C D1, D2, ..., Dm
  • Represent each term by its distributional
    frequency
  • Let ti Tf(ti, D1), Tf(ti, D2 ), ..., Tf(ti, Dm
    )
  • Term-to-term similarity is computed as
  • Sim(ti, tj) cos(vec(ti), vec(tj))
  • Hence, highly co-occurring terms like "Arafat"
    and "PLO"
  • will be treated as near-synonyms for retrieval

14
GVSM, How it Works (2)
  • And query-document similarity is computed as
    before Sim(Q,D) cos(vec(Q)), vec(D)), except
    that instead of the dot product calculation, we
    use a function of the term-to-term similarity
    computation above, For instance
  • Sim(Q,D) SiMaxj(sim(qi, dj)
  • or normalizing for document query length
  • Simnorm(Q, D)

15
GVSM, How it Works (3)
  • Primary problem
  • More computation (sparse gt dense)
  • Primary benefit
  • Automatic term expansion by corpus

16
A Critique of Pure Relevance (1)
  • IR Maximizes Relevance
  • Precision and recall are relevance measures
  • Quality of documents retrieved is ignored

17
A Critique of Pure Relevance (2)
  • Other Important Factors
  • What about information novelty, timeliness,
    appropriateness, validity, comprehensibility,
    density, medium,...??
  • In IR, we really want to maximize
  • P(U(f i , ..., f n ) Q C U H)
  • where Q query, C collection set,
  • U user profile, H interaction history
  • ...but we dont yet know how. Darn.

18
Maximal Marginal Relevance (1)
  • A crude first approximation
  • novelty gt minimal-redundancy
  • Weighted linear combination
  • (redundancy cost, relevance benefit)
  • Free parameters k and ?

19
Maximal Marginal Relevance (2)
  • MMR(Q, C, R)
  • Argmaxkdi in C?S(Q, di) - (1-?)maxdj in R (S(di,
    dj))

20
Maximal Marginal Relevance (MMR) (3)
  • COMPUTATION OF MMR RERANKING
  • 1. Standard IR Retrieval of top-N docs
  • Let Dr IR(D, Q, N)
  • 2. Rank max sim(di e Dr, Q) as top doc, i.e. Let
  • Ranked di
  • 3. Let Dr Dr\di
  • 4. While Dr is not empty, do
  • a. Find di with max MMR(Dr, Q. Ranked)
  • b. Let Ranked Ranked.di
  • c. Let Dr Dr\di

21
MMR Ranking vs Standard IR
documents
query
MMR
IR
? controls spiral curl
22
Maximal Marginal Relevance (MMR) (4)
  • Applications
  • Ranking retrieved documents from IR Engine
  • Ranking passages for inclusion in Summaries

23
Document Summarization in a Nutshell (1)
  • Types of Summaries

24
Document Summarization in a Nutshell (2)
  • Other Dimensions
  • Single vs multi document summarization
  • Genre-adaptive vs one-size-fits all
  • Single-language vs translingual
  • Flat summary vs hyperlinked pyramid
  • Text-only vs multi-media
  • ...

25
Summarization as Passage Retrieval (1)
  • For Query-Driven Summaries
  • 1. Divide document into passages
  • e.g, sentences, paragraphs, FAQ-pairs, ....
  • 2. Use query to retrieve most relevant
    passages, or better, use MMR to avoid
    redundancy.
  • 3. Assemble retrieved passages into a summary.

26
Summarization as Passage Retrieval (2)
  • For Generic Summaries
  • 1. Use title or top-k Tf-IDF terms as query.
  • 2. Proceed as Query-Driven Summarization.

27
Summarization as Passage Retrieval (3)
  • For Multidocument Summaries
  • 1. Cluster documents into topically-related
    groups.
  • 2. For each group, divide document into passages
    and keep track of source of each passage.
  • 3. Use MMR to retrieve most relevant
    non- redundant passages (MMR is necessary for
    multiple docs).
  • 4. Assemble a summary for each cluster.
Write a Comment
User Comments (0)
About PowerShow.com