Search Engine Technology

1 / 51
About This Presentation
Title:

Search Engine Technology

Description:

Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance ... Calculus: Finding extrema of functions. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 52
Provided by: rad2

less

Transcript and Presenter's Notes

Title: Search Engine Technology


1
Search Engine Technology(1)
Prof. Dragomir R. Radev radev_at_cs.columbia.edu
2
SET FALL 2009
  • Introduction

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Examples of search engines
  • Conventional (library catalog). Search by
    keyword, title, author, etc.
  • Text-based (Lexis-Nexis, Google, Yahoo!).Search
    by keywords. Limited search using queries in
    natural language.
  • Multimedia (QBIC, WebSeek, SaFe)Search by visual
    appearance (shapes, colors, ).
  • Question answering systems (Ask, NSIR,
    Answerbus)Search in (restricted) natural
    language
  • Clustering systems (Vivísimo, Clusty)
  • Research systems (Lemur, Nutch)

8
What does it take to build a search engine?
  • Decide what to index
  • Collect it
  • Index it (efficiently)
  • Keep the index up to date
  • Provide user-friendly query facilities

9
What else?
  • Understand the structure of the web for efficient
    crawling
  • Understand user information needs
  • Preprocess text and other unstructured data
  • Cluster data
  • Classify data
  • Evaluate performance

10
Goals of the course
  • Understand how search engines work
  • Understand the limits of existing search
    technology
  • Learn to appreciate the sheer size of the Web
  • Learn to wrote code for text indexing and
    retrieval
  • Learn about the state of the art in IR research
  • Learn to analyze textual and semi-structured data
    sets
  • Learn to appreciate the diversity of texts on the
    Web
  • Learn to evaluate information retrieval
  • Learn about standardized document collections
  • Learn about text similarity measures
  • Learn about semantic dimensionality reduction
  • Learn about the idiosyncracies of hyperlinked
    document collections
  • Learn about web crawling
  • Learn to use existing software
  • Understand the dynamics of the Web by building
    appropriate mathematical models
  • Build working systems that assist users in
    finding useful information on the Web

11
Course logistics
  • Thursdays 610-800
  • Office hours TBA
  • URL http//www.cs.columbia.edu/cs6998
  • Instructor Dragomir Radev
  • Email radev_at_cs.columbia.edu
  • TA
  • Yves Petinot (ypetinot_at_cs.columbia.edu)
  • Kaushal Lahankar (knl2102_at_columbia.edu)

12
Course outline
  • Classic document retrieval storing, indexing,
    retrieval.
  • Web retrieval crawling, query processing.
  • Text and web mining classification, clustering.
  • Network analysis random graph models,
    centrality, diameter and clustering coefficient.

13
Syllabus
  • Introduction.
  • Queries and Documents. Models of Information
    retrieval. The Boolean model. The Vector model.
  • Document preprocessing. Tokenization. Stemming.
    The Porter algorithm. Storing, indexing and
    searching text. Inverted indexes.
  • Word distributions. The Zipf distribution. The
    Benford distribution. Heap's law. TFIDF. Vector
    space similarity and ranking.
  • Retrieval evaluation. Precision and Recall.
    F-measure. Reference collections. The TREC
    conferences.
  • Automated indexing/labeling. Compression and
    coding. Optimal codes.
  • String matching. Approximate matching.
  • Query expansion. Relevance feedback.
  • Text classification. Naive Bayes. Feature
    selection. Decision trees.

14
Syllabus
  • Linear classifiers. k-nearest neighbors.
    Perceptron. Kernel methods. Maximum-margin
    classifiers. Support vector machines.
    Semi-supervised learning.
  • Lexical semantics and Wordnet.
  • Latent semantic indexing. Singular value
    decomposition.
  • Vector space clustering. k-means clustering. EM
    clustering.
  • Random graph models. Properties of random graphs
    clustering coefficient, betweenness, diameter,
    giant connected component, degree distribution.
  • Social network analysis. Small worlds and
    scale-free networks. Power law distributions.
    Centrality.
  • Graph-based methods. Harmonic functions. Random
    walks.
  • PageRank. Hubs and authorities. Bipartite graphs.
    HITS.
  • Models of the Web.

15
Syllabus
  • Crawling the web. Webometrics. Measuring the size
    of the web. The Bow-tie-method.
  • Hypertext retrieval. Web-based IR. Document
    closures. Focused crawling.
  • Question answering
  • Burstiness. Self-triggerability
  • Information extraction
  • Adversarial IR. Human behavior on the web.
  • Text summarization
  • POSSIBLE TOPICS
  • Discovering communities, spectral clustering
  • Semi-supervised retrieval
  • Natural language processing. XML retrieval. Text
    tiling. Human behavior on the web.

16
Readings
  • required Information Retrieval by Manning,
    Schuetze, and Raghavan (http//www-csli.stanford.e
    du/schuetze/information-retrieval-book.html),
    freely available, hard copy for sale
  • optional Modeling the Internet and the Web
    Probabilistic Methods and Algorithms by Pierre
    Baldi, Paolo Frasconi, Padhraic Smyth, Wiley,
    2003, ISBN 0-470-84906-1 (http//ibook.ics.uci.ed
    u).
  • papers from SIGIR, WWW and journals (to be
    announced in class).

17
Prerequisites
  • Linear algebra vectors and matrices.
  • Calculus Finding extrema of functions.
  • Probabilities random variables, discrete and
    continuous distributions, Bayes theorem.
  • Programming experience with at least one
    web-aware programming language such as Perl
    (highly recommended) or Java in a UNIX
    environment.
  • Required CS account

18
Course requirements
  • Three assignments (30)
  • Some of them will be in Perl. The rest can be
    done in any appropriate language. All will
    involve some data analysis and evaluation
  • Final project (30)
  • Research paper or software system.
  • Class participation (10)
  • Final exam (30)

19
Final project format
  • Research paper - using the SIGIR format. Students
    will be in charge of problem formulation,
    literature survey, hypothesis formulation,
    experimental design, implementation, and possibly
    submission to a conference like SIGIR or WWW.
  • Software system - develop a working system or
    API. Students will be responsible for identifying
    a niche problem, implementing it and deploying
    it, either on the Web or as an open-source
    downloadable tool. The system can be either stand
    alone or an extension to an existing one.

20
Project ideas
  • Build a question answering system.
  • Build a language identification system.
  • Social network analysis from the Web.
  • Participate in the Netflix challenge.
  • Query log analysis.
  • Build models of Web evolution.
  • Information diffusion in blogs or web.
  • Author-topic models of web pages.
  • Using the web for machine translation.
  • Building evolving models of web documents.
  • News recommendation system.
  • Compress the text of Wikipedia (losslessly).
  • Spelling correction using query logs.
  • Automatic query expansion.

21
List of projects from the past
  • Document Closures for Indexing
  • Tibet - Table Structure Recognition Library
  • Ruby Blog Memetracker
  • Sentence decomposition for more accurate
    information retrieval
  • Extracting Social Networks from LiveJournal
  • Google Suggest Programming Project (Java Swing
    Client and Lucene Back-End)
  • Leveraging Social Networks for Organizing and
    Browsing Shared Photographs
  • Media Bias and the Political Blogosphere
  • Measuring Similarity between search queries
  • Extracting Social Networks and Information about
    the people within them from Text
  • LSI dependency trees

22
Available corpora
  • Netflix challenge
  • AOL query logs
  • Blogs
  • Bio papers
  • AAN
  • Email
  • Generifs
  • Web pages
  • Political science corpus
  • VAST
  • del.icio.us
  • SMS
  • News data aquaint, tdt, nantc, reuters, setimes,
    trec, tipster
  • Europarl multilingual
  • US congressional data
  • DMOZ
  • Pubmedcentral
  • DUC/TAC
  • Timebank
  • Wikipedia
  • wt2g/wt10g/wt100g
  • dotgov
  • RTE
  • Paraphrases
  • GENIA
  • Generifs
  • Hansards
  • IMDB
  • MTA/MTC
  • nie
  • cnnsumm
  • Poliblog
  • Sentiment
  • xml
  • epinions
  • Enron

23
Related courses elsewhere
  • Stanford (Chris Manning, Prabhakar Raghavan, and
    Hinrich Schuetze)
  • Cornell (Jon Kleinberg)
  • CMU (Yiming Yang and Jamie Callan)
  • UMass (James Allan)
  • UTexas (Ray Mooney)
  • Illinois (Chengxiang Zhai)
  • Johns Hopkins (David Yarowsky)
  • For a long list of courses related to Search
    Engines, Natural Language Processing, Machine
    Learning look here http//tangra.si.umich.edu/c
    lair/clair/courses.html

24
SET FALL 2009
2. Models of Information retrieval The
Vector model The Boolean model
25
The web is really large
  • 100 B pages
  • Dynamically generated content
  • New pages get added all the time
  • Technorati has 50M blogs
  • The size of the blogosphere doubles every 6
    months
  • Yahoo deals with 12TB of data per day (according
    to Ron Brachman)

26
Sample queries (from Excite)
In what year did baseball become an offical
sport? play station codes . com birth control and
depression government "WorkAbility
I"conference kitchen appliances where can I find
a chines rosewood tiger electronics 58 Plymouth
Fury How does the character Seyavash in
Ferdowsi's Shahnameh exhibit characteristics of a
hero? emeril Lagasse Hubble M.S Subalaksmi running
27
Fun things to do with search engines
  • Googlewhack
  • Reduce document set size to 1
  • Find query that will bring given URL in the top
    10

28
Key Terms Used in IR
  • QUERY a representation of what the user is
    looking for - can be a list of words or a phrase.
  • DOCUMENT an information entity that the user
    wants to retrieve
  • COLLECTION a set of documents
  • INDEX a representation of information that makes
    querying easier
  • TERM word or concept that appears in a document
    or a query

29
Mappings and abstractions
Reality
Data
Information need
Query
From Robert Korfhages book
30
Documents
  • Not just printed paper
  • Can be records, pages, sites, images, people,
    movies
  • Document encoding (Unicode)
  • Document representation
  • Document preprocessing

31
Sample query sessions (from AOL)
  • toley spies gramestolley spies gamestotally
    spies games
  • tajmahal restaurant brooklyn nytaj mahal
    restaurant brooklyn nytaj mahal restaurant
    brooklyn ny 11209
  • do you love me like you saydo you love me like
    you say lyricsdo you love me like you say lyrics
    marvin gaye

M /data4/corpora/AOL-user-ct-collection
32
Characteristics of user queries
  • Sessions users revisit their queries.
  • Very short queries typically 2 words long.
  • A large number of typos.
  • A small number of popular queries. A long tail of
    infrequent ones.
  • Almost no use of advanced query operators with
    the exception of double quotes

33
Queries as documents
  • Advantages
  • Mathematically easier to manage
  • Problems
  • Different lengths
  • Syntactic differences
  • Repetitions of words (or lack thereof)

34
Document representations
  • Term-document matrix (m x n)
  • Document-document matrix (n x n)
  • Typical example in a medium-sized collection
    3,000,000 documents (n) with 50,000 terms (m)
  • Typical example on the Web n30,000,000,000,
    m1,000,000
  • Boolean vs. integer-valued matrices

35
Storage issues
  • Imagine a medium-sized collection with
    n3,000,000 and m50,000
  • How large a term-document matrix will be needed?
  • Is there any way to do better? Any heuristic?

36
Inverted index
  • Instead of an incidence vector, use a posting
    table
  • CLEVELAND D1, D2, D6
  • OHIO D1, D5, D6, D7
  • Use linked lists to be able to insert new
    document postings in order and to remove existing
    postings.
  • Keep everything sorted! This gives you a
    logarithmic improvement in access.

37
Basic operations on inverted indexes
  • Conjunction (AND) iterative merge of the two
    postings O(xy)
  • Disjunction (OR) very similar
  • Negation (NOT) can we still do it in O(xy)?
  • Example MICHIGAN AND NOT OHIO
  • Example MICHIGAN OR NOT OHIO
  • Recursive operations
  • Optimization start with the smallest sets

38
Major IR models
  • Boolean
  • Vector
  • Probabilistic
  • Language modeling
  • Fuzzy retrieval
  • Latent semantic indexing

39
The Boolean model
Venn diagrams
z
x
w
y
D1
D2
40
Boolean queries
  • Operators AND, OR, NOT, parentheses
  • Example
  • CLEVELAND AND NOT OHIO
  • (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA)
  • Ambiguous uses of AND and OR in human language
  • Exclusive vs. inclusive OR
  • Restrictive operator AND or OR?

41
Canonical forms of queries
  • De Morgans Laws

NOT (A AND B) (NOT A) OR (NOT B)
NOT (A OR B) (NOT A) AND (NOT B)
  • Normal forms
  • Conjunctive normal form (CNF)
  • Disjunctive normal form (DNF)
  • Reference librarians prefer CNF - why?

42
Evaluating Boolean queries
  • Incidence vectors
  • CLEVELAND 1100010
  • OHIO 1000111
  • Examples
  • CLEVELAND AND OHIO
  • CLEVELAND AND NOT OHIO
  • CLEVALAND OR OHIO

43
Exercise
  • D1 computer information retrieval
  • D2 computer retrieval
  • D3 information
  • D4 computer information
  • Q1 information AND retrieval
  • Q2 information AND NOT computer

44
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT
chaucer) AND (swift OR shakespeare))
45
How to deal with?
  • Multi-word phrases?
  • Document ranking?

46
The Vector model
Term 1
Doc 1
Doc 2
Term 3
Doc 3
Term 2
47
Vector queries
  • Each document is represented as a vector
  • Non-efficient representation
  • Dimensional compatibility

48
The matching process
  • Document space
  • Matching is done between a document and a query
    (or between two documents)
  • Distance vs. similarity measures.
  • Euclidean distance, Manhattan distance, Word
    overlap, Jaccard coefficient, etc.

49
Miscellaneous similarity measures
  • The Cosine measure (normalized dot product)

? (di x qi)
X ? Y
? (D,Q)

? (di)2
? (qi)2

X Y
  • The Jaccard coefficient

X ? Y
? (D,Q)
X ? Y
50
Exercise
  • Compute the cosine scores ? (D1,D2) and ? (D1,D3)
    for the documents D1 lt1,3gt, D2 lt100,300gt and
    D3 lt3,1gt
  • Compute the corresponding Euclidean distances,
    Manhattan distances, and Jaccard coefficients.

51
Readings
  • (1) MRS1, MRS2, MRS5 (Zipf)
  • (2) MRS7, MRS8
Write a Comment
User Comments (0)