Information Retrieval (1) - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Information Retrieval (1)

Description:

Information Retrieval (1) Prof. Dragomir R. Radev radev_at_umich.edu – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 52
Provided by: rade65
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval (1)


1
Information Retrieval(1)
  • Prof. Dragomir R. Radev
  • radev_at_umich.edu

2
IR WINTER 2010
  • Introduction

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Examples of search engines
  • Conventional (library catalog). Search by
    keyword, title, author, etc.
  • Text-based (Lexis-Nexis, Google, Yahoo!).Search
    by keywords. Limited search using queries in
    natural language.
  • Multimedia (QBIC, WebSeek, SaFe)Search by visual
    appearance (shapes, colors, ).
  • Question answering systems (Ask, NSIR,
    Answerbus)Search in (restricted) natural
    language
  • Clustering systems (Vivísimo, Clusty)
  • Research systems (Lemur, Nutch)

8
What does it take to build a search engine?
  • Decide what to index
  • Collect it
  • Index it (efficiently)
  • Keep the index up to date
  • Provide user-friendly query facilities

9
What else?
  • Understand the structure of the web for efficient
    crawling
  • Understand user information needs
  • Preprocess text and other unstructured data
  • Cluster data
  • Classify data
  • Evaluate performance

10
Goals of the course
  • Understand how search engines work
  • Understand the limits of existing search
    technology
  • Learn to appreciate the sheer size of the Web
  • Learn to wrote code for text indexing and
    retrieval
  • Learn about the state of the art in IR research
  • Learn to analyze textual and semi-structured data
    sets
  • Learn to appreciate the diversity of texts on the
    Web
  • Learn to evaluate information retrieval
  • Learn about standardized document collections
  • Learn about text similarity measures
  • Learn about semantic dimensionality reduction
  • Learn about the idiosyncracies of hyperlinked
    document collections
  • Learn about web crawling
  • Learn to use existing software
  • Understand the dynamics of the Web by building
    appropriate mathematical models
  • Build working systems that assist users in
    finding useful information on the Web

11
Course logistics
  • Fridays 210-455 PM
  • Office hours TBA
  • URL http//clair.si.umich.edu/si650
  • Instructor Dragomir Radev
  • Email radev_at_umich.edu
  • Instructor Qiaozhu Mei
  • Email qmei_at_umich.edu

12
Course outline
  • Classic document retrieval storing, indexing,
    retrieval.
  • Web retrieval crawling, query processing.
  • Text and web mining classification, clustering.
  • Network analysis random graph models,
    centrality, diameter and clustering coefficient.

13
Syllabus
  • Introduction.
  • Queries and Documents. Models of Information
    retrieval. The Boolean model. The Vector model.
  • Document preprocessing. Tokenization. Stemming.
    The Porter algorithm. Storing, indexing and
    searching text. Inverted indexes.
  • Word distributions. The Zipf distribution. The
    Benford distribution. Heap's law. TFIDF. Vector
    space similarity and ranking.
  • Retrieval evaluation. Precision and Recall.
    F-measure. Reference collections. The TREC
    conferences.
  • Automated indexing/labeling. Compression and
    coding. Optimal codes.
  • String matching. Approximate matching.
  • Query expansion. Relevance feedback.
  • Text classification. Naive Bayes. Feature
    selection. Decision trees.

14
Syllabus
  • Linear classifiers. k-nearest neighbors.
    Perceptron. Kernel methods. Maximum-margin
    classifiers. Support vector machines.
    Semi-supervised learning.
  • Lexical semantics and Wordnet.
  • Latent semantic indexing. Singular value
    decomposition.
  • Vector space clustering. k-means clustering. EM
    clustering.
  • Random graph models. Properties of random graphs
    clustering coefficient, betweenness, diameter,
    giant connected component, degree distribution.
  • Social network analysis. Small worlds and
    scale-free networks. Power law distributions.
    Centrality.
  • Graph-based methods. Harmonic functions. Random
    walks.
  • PageRank. Hubs and authorities. Bipartite graphs.
    HITS.
  • Models of the Web.

15
Syllabus
  • Crawling the web. Webometrics. Measuring the size
    of the web. The Bow-tie-method.
  • Hypertext retrieval. Web-based IR. Document
    closures. Focused crawling.
  • Question answering
  • Burstiness. Self-triggerability
  • Information extraction
  • Adversarial IR. Human behavior on the web.
  • Text summarization
  • POSSIBLE TOPICS
  • Discovering communities, spectral clustering
  • Semi-supervised retrieval
  • Natural language processing. XML retrieval. Text
    tiling. Human behavior on the web.

16
Readings
  • required Information Retrieval by Manning,
    Schuetze, and Raghavan (http//www-csli.stanford.e
    du/schuetze/information-retrieval-book.html),
    freely available, hard copy for sale
  • optional Modeling the Internet and the Web
    Probabilistic Methods and Algorithms by Pierre
    Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
    2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
    u).
  • papers from SIGIR, WWW and journals (to be
    announced in class).

17
Prerequisites
  • Linear algebra vectors and matrices.
  • Calculus Finding extrema of functions.
  • Probabilities random variables, discrete and
    continuous distributions, Bayes theorem.
  • Programming experience with at least one
    web-aware programming language such as Perl
    (highly recommended) or Java in a UNIX
    environment.
  • Required CS account

18
Course requirements
  • Three assignments (30)
  • Some of them will be in Perl. The rest can be
    done in any appropriate language. All will
    involve some data analysis and evaluation
  • Final project (30)
  • Research paper or software system.
  • Class participation (10)
  • Final exam (30)

19
Final project format
  • Research paper - using the SIGIR format. Students
    will be in charge of problem formulation,
    literature survey, hypothesis formulation,
    experimental design, implementation, and possibly
    submission to a conference like SIGIR or WWW.
  • Software system - develop a working system or
    API. Students will be responsible for identifying
    a niche problem, implementing it and deploying
    it, either on the Web or as an open-source
    downloadable tool. The system can be either stand
    alone or an extension to an existing one.

20
Project ideas
  • Build a question answering system.
  • Build a language identification system.
  • Social network analysis from the Web.
  • Participate in the Netflix challenge.
  • Query log analysis.
  • Build models of Web evolution.
  • Information diffusion in blogs or web.
  • Author-topic models of web pages.
  • Using the web for machine translation.
  • Building evolving models of web documents.
  • News recommendation system.
  • Compress the text of Wikipedia (losslessly).
  • Spelling correction using query logs.
  • Automatic query expansion.

21
List of projects from the past
  • Document Closures for Indexing
  • Tibet - Table Structure Recognition Library
  • Ruby Blog Memetracker
  • Sentence decomposition for more accurate
    information retrieval
  • Extracting Social Networks from LiveJournal
  • Google Suggest Programming Project (Java Swing
    Client and Lucene Back-End)
  • Leveraging Social Networks for Organizing and
    Browsing Shared Photographs
  • Media Bias and the Political Blogosphere
  • Measuring Similarity between search queries
  • Extracting Social Networks and Information about
    the people within them from Text
  • LSI dependency trees

22
Available corpora
  • Netflix challenge
  • AOL query logs
  • Blogs
  • Bio papers
  • AAN
  • Email
  • Generifs
  • Web pages
  • Political science corpus
  • VAST
  • del.icio.us
  • SMS
  • News data aquaint, tdt, nantc, reuters, setimes,
    trec, tipster
  • Europarl multilingual
  • US congressional data
  • DMOZ
  • Pubmedcentral
  • DUC/TAC
  • Timebank
  • Wikipedia
  • wt2g/wt10g/wt100g
  • dotgov
  • RTE
  • Paraphrases
  • GENIA
  • Generifs
  • Hansards
  • IMDB
  • MTA/MTC
  • nie
  • cnnsumm
  • Poliblog
  • Sentiment
  • xml
  • epinions
  • Enron

23
Related courses elsewhere
  • Stanford (Chris Manning, Prabhakar Raghavan, and
    Hinrich Schuetze)
  • Cornell (Jon Kleinberg)
  • CMU (Yiming Yang and Jamie Callan)
  • UMass (James Allan)
  • UTexas (Ray Mooney)
  • Illinois (Chengxiang Zhai)
  • Johns Hopkins (David Yarowsky)
  • For a long list of courses related to Search
    Engines, Natural Language Processing, Machine
    Learning look here http//tangra.si.umich.edu/c
    lair/clair/courses.html

24
IR WINTER 2010
2. Models of Information retrieval The
Vector model The Boolean model
25
The web is really large
  • 100 B pages
  • Dynamically generated content
  • New pages get added all the time
  • Technorati has 50M blogs
  • The size of the blogosphere doubles every 6
    months
  • Yahoo deals with 12TB of data per day (according
    to Ron Brachman)

26
Sample queries (from Excite)
  • In what year did baseball become an offical
    sport?
  • play station codes . com
  • birth control and depression
  • government
  • "WorkAbility I"conference
  • kitchen appliances
  • where can I find a chines rosewood
  • tiger electronics
  • 58 Plymouth Fury
  • How does the character Seyavash in Ferdowsi's
    Shahnameh exhibit characteristics of a hero?
  • emeril Lagasse
  • Hubble
  • M.S Subalaksmi
  • running

27
Fun things to do with search engines
  • Googlewhack
  • Reduce document set size to 1
  • Find query that will bring given URL in the top
    10

28
Key Terms Used in IR
  • QUERY a representation of what the user is
    looking for - can be a list of words or a phrase.
  • DOCUMENT an information entity that the user
    wants to retrieve
  • COLLECTION a set of documents
  • INDEX a representation of information that makes
    querying easier
  • TERM word or concept that appears in a document
    or a query

29
Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
30
Documents
  • Not just printed paper
  • Can be records, pages, sites, images, people,
    movies
  • Document encoding (Unicode)
  • Document representation
  • Document preprocessing

31
Sample query sessions (from AOL)
  • toley spies gramestolley spies gamestotally
    spies games
  • tajmahal restaurant brooklyn nytaj mahal
    restaurant brooklyn nytaj mahal restaurant
    brooklyn ny 11209
  • do you love me like you saydo you love me like
    you say lyricsdo you love me like you say lyrics
    marvin gaye

M /data4/corpora/AOL-user-ct-collection
32
Characteristics of user queries
  • Sessions users revisit their queries.
  • Very short queries typically 2 words long.
  • A large number of typos.
  • A small number of popular queries. A long tail of
    infrequent ones.
  • Almost no use of advanced query operators with
    the exception of double quotes

33
Queries as documents
  • Advantages
  • Mathematically easier to manage
  • Problems
  • Different lengths
  • Syntactic differences
  • Repetitions of words (or lack thereof)

34
Document representations
  • Term-document matrix (m x n)
  • Document-document matrix (n x n)
  • Typical example in a medium-sized collection
    3,000,000 documents (n) with 50,000 terms (m)
  • Typical example on the Web n30,000,000,000,
    m1,000,000
  • Boolean vs. integer-valued matrices

35
Storage issues
  • Imagine a medium-sized collection with
    n3,000,000 and m50,000
  • How large a term-document matrix will be needed?
  • Is there any way to do better? Any heuristic?

36
Inverted index
  • Instead of an incidence vector, use a posting
    table
  • CLEVELAND D1, D2, D6
  • OHIO D1, D5, D6, D7
  • Use linked lists to be able to insert new
    document postings in order and to remove existing
    postings.
  • Keep everything sorted! This gives you a
    logarithmic improvement in access.

37
Basic operations on inverted indexes
  • Conjunction (AND) iterative merge of the two
    postings O(xy)
  • Disjunction (OR) very similar
  • Negation (NOT) can we still do it in O(xy)?
  • Example MICHIGAN AND NOT OHIO
  • Example MICHIGAN OR NOT OHIO
  • Recursive operations
  • Optimization start with the smallest sets

38
Major IR models
  • Boolean
  • Vector
  • Probabilistic
  • Language modeling
  • Fuzzy retrieval
  • Latent semantic indexing

39
The Boolean model
Venn diagrams
z
x
w
y
D1
D2
40
Boolean queries
  • Operators AND, OR, NOT, parentheses
  • Example
  • CLEVELAND AND NOT OHIO
  • (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
  • Ambiguous uses of AND and OR in human language
  • Exclusive vs. inclusive OR
  • Restrictive operator AND or OR?

41
Canonical forms of queries
  • De Morgans Laws

NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
  • Normal forms
  • Conjunctive normal form (CNF)
  • Disjunctive normal form (DNF)
  • Reference librarians prefer CNF - why?

42
Evaluating Boolean queries
  • Incidence vectors
  • CLEVELAND 1100010
  • OHIO 1000111
  • Examples
  • CLEVELAND AND OHIO
  • CLEVELAND AND NOT OHIO
  • CLEVALAND OR OHIO

43
Exercise
  • D1 computer information retrieval
  • D2 computer retrieval
  • D3 information
  • D4 computer information
  • Q1 information AND retrieval
  • Q2 information AND NOT computer

44
Exercise
0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
45
How to deal with?
  • Multi-word phrases?
  • Document ranking?

46
The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
47
Vector queries
  • Each document is represented as a vector
  • Non-efficient representation
  • Dimensional compatibility

48
The matching process
  • Document space
  • Matching is done between a document and a query
    (or between two documents)
  • Distance vs. similarity measures.
  • Euclidean distance, Manhattan distance, Word
    overlap, Jaccard coefficient, etc.

49
Miscellaneous similarity measures
  • The Cosine measure (normalized dot product)

? (di x qi)
X ? Y
? (D,Q)

? (di)2
? (qi)2

X Y
  • The Jaccard coefficient

X ? Y
? (D,Q)
X ? Y
50
Exercise
  • Compute the cosine scores ? (D1,D2) and ? (D1,D3)
    for the documents D1 lt1,3gt, D2 lt100,300gt and
    D3 lt3,1gt
  • Compute the corresponding Euclidean distances,
    Manhattan distances, and Jaccard coefficients.

51
Readings
  • (1) MRS1, MRS2, MRS5 (Zipf)
  • (2) MRS7, MRS8
Write a Comment
User Comments (0)
About PowerShow.com