Web Information Retrieval - PowerPoint PPT Presentation


PPT – Web Information Retrieval PowerPoint presentation | free to download - id: 72f4f8-NmI0M


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Web Information Retrieval


What to Expect. Information Retrieval Basics. IR Systems. History of IR. Retrieval Models. Vector Space Model. Information Retrieval on the Web. Differences to ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 46
Provided by: Claud229
Learn more at: http://www.kbs.uni-hannover.de


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Web Information Retrieval

Web Information Retrieval
  • Web Science Course

(No Transcript)
What to Expect
  • Information Retrieval Basics
  • IR Systems
  • History of IR
  • Retrieval Models
  • Vector Space Model
  • Information Retrieval on the Web
  • Differences to traditional IR
  • Selected Papers

Information Retrieval Basics
Information Retrieval (IR)
  • The indexing and retrieval of textual documents.
  • Concerned firstly with retrieving relevant
    documents to a query.
  • Concerned secondly with retrieving from large
    sets of documents efficiently.

Typical IR Task
  • Given
  • A corpus of textual natural-language documents.
  • A user query in the form of a textual string.
  • Find
  • A ranked set of documents that are relevant to
    the query.

IR System
IR System
  • Relevance is a subjective judgment and may
  • Being on the proper subject.
  • Being timely (recent information).
  • Being authoritative (from a trusted source).
  • Satisfying the goals of the user and his/her
    intended use of the information (information

Keyword Search
  • Simplest notion of relevance is that the query
    string appears verbatim in the document.
  • Slightly less strict notion is that the words in
    the query appear frequently in the document, in
    any order (bag of words).

Problems with Keywords
  • May not retrieve relevant documents that include
    synonymous terms.
  • restaurant vs. café
  • PRC vs. China
  • May retrieve irrelevant documents that include
    ambiguous terms.
  • bat (baseball vs. mammal)
  • Apple (company vs. fruit)
  • bit (unit of data vs. act of eating)

Intelligent IR
  • Taking into account the meaning of the words
  • Taking into account the order of words in the
  • Adapting to the user based on direct or indirect
  • Taking into account the authority of the source.

IR System Architecture
User Interface
User Need
Text Operations
Logical View
Database Manager
Query Operations
User Feedback
Inverted file
Text Database
Ranked Docs
Retrieved Docs
IR System Components
  • Text Operations forms index words (tokens).
  • Stopword removal
  • Stemming
  • Indexing constructs an inverted index of word to
    document pointers.
  • Searching retrieves documents that contain a
    given query token from the inverted index.
  • Ranking scores all retrieved documents according
    to a relevance metric.

IR System Components (continued)
  • User Interface manages interaction with the user
  • Query input and document output.
  • Relevance feedback.
  • Visualization of results.
  • Query Operations transform the query to improve
  • Query expansion using a thesaurus.
  • Query transformation using relevance feedback.

History of IR
  • 1960-70s
  • Initial exploration of text retrieval systems
    for small corpora of scientific abstracts, and
    law and business documents.
  • 1980s
  • Large document database systems, many run by
  • 1990s
  • Searching FTPable documents on the Internet
  • Searching the World Wide Web

Recent IR History
  • 2000s
  • Link analysis for Web Search
  • Automated Information Extraction
  • Question Answering
  • Multimedia IR
  • Cross-Language IR
  • Document Summarization

Vector Space Retrieval Model
Retrieval Models
  • A retrieval model specifies the details of
  • Document representation
  • Query representation
  • Retrieval function
  • Determines a notion of relevance.
  • Notion of relevance can be binary or continuous
    (i.e. ranked retrieval).

Preprocessing Steps
  • Strip unwanted characters/markup (e.g. HTML
    tags, punctuation, numbers, etc.).
  • Break into tokens (keywords) on whitespace.
  • Stem tokens to root words
  • computational ? comput
  • Remove common stopwords (e.g. a, the, it, etc.).
  • Build inverted index (keyword ? list of docs
    containing it).

The Vector-Space Model
  • Assume t distinct terms after preprocessing call
    them index terms or the vocabulary.
  • These orthogonal terms form a vector space.
  • Dimension t vocabulary
  • Each term, i, in a document or query, j, is given
    a real-valued weight, wij.
  • Both documents and queries are expressed as
    t-dimensional vectors
  • dj (w1j, w2j, , wtj)

Graphic Representation
  • Example
  • D1 2T1 3T2 5T3
  • D2 3T1 7T2 T3
  • Q 0T1 0T2 2T3
  • Is D1 or D2 more similar to Q?
  • How to measure the degree of similarity?
    Distance? Angle? Projection?

Term Weights Term Frequency
  • More frequent terms in a document are more
    important, i.e. more indicative of the topic.
  • fij frequency of term i in document j
  • May want to normalize term frequency (tf) by
    dividing by the frequency of the most common term
    in the document
  • tfij fij / maxifij

Term Weights Inverse Document Frequency
  • Terms that appear in many different documents are
    less indicative of overall topic.
  • df i document frequency of term i
  • number of documents containing term
  • idfi inverse document frequency of term i,
  • log2 (N/ df i)
  • (N total number of documents)
  • An indication of a terms discrimination power.
  • Log used to dampen the effect relative to tf.

TF-IDF Weighting
  • A typical combined term importance indicator is
    tf-idf weighting
  • wij tfij idfi tfij log2 (N/ dfi)
  • A term occurring frequently in the document but
    rarely in the rest of the collection is given
    high weight.
  • Many other ways of determining term weights have
    been proposed.
  • Experimentally, tf-idf has been found to work

Computing TF-IDF -- An Example
  • Given a document containing terms with given
  • A(3), B(2), C(1)
  • Assume collection contains 10,000 documents and
  • document frequencies of these terms are
  • A(50), B(1300), C(250)
  • Then
  • A tf 3/3 idf log2(10000/50) 7.6
    tf-idf 7.6
  • B tf 2/3 idf log2 (10000/1300) 2.9
    tf-idf 2.0
  • C tf 1/3 idf log2 (10000/250) 5.3
    tf-idf 1.8

Query Vector
  • Query vector is typically treated as a document
    and also tf-idf weighted.
  • Alternative is for the user to supply weights for
    the given query terms.

Similarity Measure
  • A similarity measure is a function that computes
    the degree of similarity between two vectors.
  • Using a similarity measure between the query and
    each document
  • It is possible to rank the retrieved documents in
    the order of presumed relevance.
  • It is possible to enforce a certain threshold so
    that the size of the retrieved set can be

Cosine Similarity Measure
  • Cosine similarity measures the cosine of the
    angle between two vectors.
  • Inner product normalized by the vector lengths.

CosSim(dj, q)
D1 2T1 3T2 5T3 CosSim(D1 , Q) 10 /
?(4925)(004) 0.81 D2 3T1 7T2 1T3
CosSim(D2 , Q) 2 / ?(9491)(004) 0.13 Q
0T1 0T2 2T3
Naïve Implementation
  • Convert all documents in collection D to tf-idf
    weighted vectors, dj, for keyword vocabulary V.
  • Convert query to a tf-idf-weighted vector q.
  • For each dj in D do
  • Compute score sj cosSim(dj, q)
  • Sort documents by decreasing score.
  • Present top ranked documents to the user.
  • Time complexity O(VD) Bad for large V
    D !
  • V 10,000 D 100,000 VD

Inverted Index
Comments on Vector Space Models
  • Simple, mathematically based approach.
  • Considers both local (tf) and global (idf) word
    occurrence frequencies.
  • Provides partial matching and ranked results.
  • Tends to work quite well in practice despite
    obvious weaknesses.
  • Allows efficient implementation for large
    document collections.
  • Does not require all terms in the query

Web Search
Web Search
  • Application of IR to HTML documents on the World
    Wide Web.
  • Differences
  • Must assemble document corpus by spidering the
  • Can exploit the structural layout information in
    HTML (XML).
  • Documents change uncontrollably.
  • Can exploit the link structure of the web.

Web Search Using IR
IR System
The World Wide Web
  • Developed by Tim Berners-Lee in 1990 at CERN to
    organize research documents available on the
  • Combined idea of documents available by FTP with
    the idea of hypertext to link documents.
  • Developed initial HTTP network protocol, URLs,
    HTML, and first web server.

Web Search Recent History
  • In 1998, Larry Page and Sergey Brin, Ph.D.
    students at Stanford, started Google. Main
    advance is use of link analysis to rank results
    partially based on authority.

Web Challenges for IR
  • Distributed Data Documents spread over millions
    of different web servers.
  • Volatile Data Many documents change or
    disappear rapidly (e.g. dead links).
  • Large Volume Billions of separate documents.
  • Unstructured and Redundant Data No uniform
    structure, HTML errors, up to 30 (near)
    duplicate documents.
  • Quality of Data No editorial control, false
    information, poor quality writing, typos, etc.
  • Heterogeneous Data Multiple media types (images,
    video, VRML), languages, character sets, etc.

Growth of Web Pages Indexed
Google Inktomi AllTheWeb Teoma Altavista
Billions of Pages
Link to Note from Jan 2004
Assuming 20KB per page, 1 billion pages is about
20 terabytes of data.
Graph Structure in the Web
Selected Papers
1. A Taxonomy of Web Search
  • Andrei Broder, 2002
  • Query log analysis user survey
  • Classify web queries according to their intent
    into 3 classes
  • Navigational
  • Informational
  • Transactional
  • How global search engines evolved to deal with
    web-specific needs

2. Personalizing Search via Automated Analysis of
Interests and Activities
  • Jaime Teevan, Susan Dumais, Eric Horvitz, 2005
  • Formulate and study search personalization
  • Relevance feedback framework
  • Rich models of user interests built from
  • Previously issued queries
  • Previously visited Web pages
  • Documents and emails the user has read and created

3. Personalized Query Expansion for the Web
  • Paul Chirita, Claudiu Firan, Wolfgang Nejdl, 2007
  • Improve Web queries by expanding them
  • Five broad techniques for generating the
    additional query keywords
  • Term and compound level analysis
  • Global co-occurrence statistics
  • Use external thesauri

4. Boilerplate Detection using Shallow Text
  • Christian Kohlschütter, Peter Fankhauser,
    Wolfgang Nejdl, 2010
  • Boilerplate text typically is not related to the
    main content
  • Analyze a small set of shallow text features for
    classifying the individual text elements in a Web
  • Test impact of boilerplate removal to retrieval

For You to Choose
  • A Taxonomy of Web Search
  • Personalizing Search via Automated Analysis of
    Interests and Activities
  • Personalized Query Expansion for the Web
  • Boilerplate Detection using Shallow Text Features
About PowerShow.com