Web Information Retrieval - PowerPoint PPT Presentation

Loading...

PPT – Web Information Retrieval PowerPoint presentation | free to download - id: 72f4f8-NmI0M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Web Information Retrieval

Description:

What to Expect. Information Retrieval Basics. IR Systems. History of IR. Retrieval Models. Vector Space Model. Information Retrieval on the Web. Differences to ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 46
Provided by: Claud229
Learn more at: http://www.kbs.uni-hannover.de
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Web Information Retrieval


1
Web Information Retrieval
  • Web Science Course

2
(No Transcript)
3
What to Expect
  • Information Retrieval Basics
  • IR Systems
  • History of IR
  • Retrieval Models
  • Vector Space Model
  • Information Retrieval on the Web
  • Differences to traditional IR
  • Selected Papers

4
Information Retrieval Basics
5
Information Retrieval (IR)
  • The indexing and retrieval of textual documents.
  • Concerned firstly with retrieving relevant
    documents to a query.
  • Concerned secondly with retrieving from large
    sets of documents efficiently.

6
Typical IR Task
  • Given
  • A corpus of textual natural-language documents.
  • A user query in the form of a textual string.
  • Find
  • A ranked set of documents that are relevant to
    the query.

7
IR System
IR System
8
Relevance
  • Relevance is a subjective judgment and may
    include
  • Being on the proper subject.
  • Being timely (recent information).
  • Being authoritative (from a trusted source).
  • Satisfying the goals of the user and his/her
    intended use of the information (information
    need).

9
Keyword Search
  • Simplest notion of relevance is that the query
    string appears verbatim in the document.
  • Slightly less strict notion is that the words in
    the query appear frequently in the document, in
    any order (bag of words).

10
Problems with Keywords
  • May not retrieve relevant documents that include
    synonymous terms.
  • restaurant vs. café
  • PRC vs. China
  • May retrieve irrelevant documents that include
    ambiguous terms.
  • bat (baseball vs. mammal)
  • Apple (company vs. fruit)
  • bit (unit of data vs. act of eating)

11
Intelligent IR
  • Taking into account the meaning of the words
    used.
  • Taking into account the order of words in the
    query.
  • Adapting to the user based on direct or indirect
    feedback.
  • Taking into account the authority of the source.

12
IR System Architecture
User Interface
Text
User Need
Text Operations
Logical View
Database Manager
Indexing
Query Operations
User Feedback
Inverted file
Searching
Index
Query
Text Database
Ranked Docs
Retrieved Docs
Ranking
13
IR System Components
  • Text Operations forms index words (tokens).
  • Stopword removal
  • Stemming
  • Indexing constructs an inverted index of word to
    document pointers.
  • Searching retrieves documents that contain a
    given query token from the inverted index.
  • Ranking scores all retrieved documents according
    to a relevance metric.

14
IR System Components (continued)
  • User Interface manages interaction with the user
  • Query input and document output.
  • Relevance feedback.
  • Visualization of results.
  • Query Operations transform the query to improve
    retrieval
  • Query expansion using a thesaurus.
  • Query transformation using relevance feedback.

15
History of IR
  • 1960-70s
  • Initial exploration of text retrieval systems
    for small corpora of scientific abstracts, and
    law and business documents.
  • 1980s
  • Large document database systems, many run by
    companies
  • 1990s
  • Searching FTPable documents on the Internet
  • Searching the World Wide Web

16
Recent IR History
  • 2000s
  • Link analysis for Web Search
  • Automated Information Extraction
  • Question Answering
  • Multimedia IR
  • Cross-Language IR
  • Document Summarization

17
Vector Space Retrieval Model
18
Retrieval Models
  • A retrieval model specifies the details of
  • Document representation
  • Query representation
  • Retrieval function
  • Determines a notion of relevance.
  • Notion of relevance can be binary or continuous
    (i.e. ranked retrieval).

19
Preprocessing Steps
  • Strip unwanted characters/markup (e.g. HTML
    tags, punctuation, numbers, etc.).
  • Break into tokens (keywords) on whitespace.
  • Stem tokens to root words
  • computational ? comput
  • Remove common stopwords (e.g. a, the, it, etc.).
  • Build inverted index (keyword ? list of docs
    containing it).

20
The Vector-Space Model
  • Assume t distinct terms after preprocessing call
    them index terms or the vocabulary.
  • These orthogonal terms form a vector space.
  • Dimension t vocabulary
  • Each term, i, in a document or query, j, is given
    a real-valued weight, wij.
  • Both documents and queries are expressed as
    t-dimensional vectors
  • dj (w1j, w2j, , wtj)

21
Graphic Representation
  • Example
  • D1 2T1 3T2 5T3
  • D2 3T1 7T2 T3
  • Q 0T1 0T2 2T3
  • Is D1 or D2 more similar to Q?
  • How to measure the degree of similarity?
    Distance? Angle? Projection?

22
Term Weights Term Frequency
  • More frequent terms in a document are more
    important, i.e. more indicative of the topic.
  • fij frequency of term i in document j
  • May want to normalize term frequency (tf) by
    dividing by the frequency of the most common term
    in the document
  • tfij fij / maxifij

23
Term Weights Inverse Document Frequency
  • Terms that appear in many different documents are
    less indicative of overall topic.
  • df i document frequency of term i
  • number of documents containing term
    i
  • idfi inverse document frequency of term i,
  • log2 (N/ df i)
  • (N total number of documents)
  • An indication of a terms discrimination power.
  • Log used to dampen the effect relative to tf.

24
TF-IDF Weighting
  • A typical combined term importance indicator is
    tf-idf weighting
  • wij tfij idfi tfij log2 (N/ dfi)
  • A term occurring frequently in the document but
    rarely in the rest of the collection is given
    high weight.
  • Many other ways of determining term weights have
    been proposed.
  • Experimentally, tf-idf has been found to work
    well.

25
Computing TF-IDF -- An Example
  • Given a document containing terms with given
    frequencies
  • A(3), B(2), C(1)
  • Assume collection contains 10,000 documents and
  • document frequencies of these terms are
  • A(50), B(1300), C(250)
  • Then
  • A tf 3/3 idf log2(10000/50) 7.6
    tf-idf 7.6
  • B tf 2/3 idf log2 (10000/1300) 2.9
    tf-idf 2.0
  • C tf 1/3 idf log2 (10000/250) 5.3
    tf-idf 1.8

26
Query Vector
  • Query vector is typically treated as a document
    and also tf-idf weighted.
  • Alternative is for the user to supply weights for
    the given query terms.

27
Similarity Measure
  • A similarity measure is a function that computes
    the degree of similarity between two vectors.
  • Using a similarity measure between the query and
    each document
  • It is possible to rank the retrieved documents in
    the order of presumed relevance.
  • It is possible to enforce a certain threshold so
    that the size of the retrieved set can be
    controlled.

28
Cosine Similarity Measure
  • Cosine similarity measures the cosine of the
    angle between two vectors.
  • Inner product normalized by the vector lengths.

CosSim(dj, q)
D1 2T1 3T2 5T3 CosSim(D1 , Q) 10 /
?(4925)(004) 0.81 D2 3T1 7T2 1T3
CosSim(D2 , Q) 2 / ?(9491)(004) 0.13 Q
0T1 0T2 2T3
29
Naïve Implementation
  • Convert all documents in collection D to tf-idf
    weighted vectors, dj, for keyword vocabulary V.
  • Convert query to a tf-idf-weighted vector q.
  • For each dj in D do
  • Compute score sj cosSim(dj, q)
  • Sort documents by decreasing score.
  • Present top ranked documents to the user.
  • Time complexity O(VD) Bad for large V
    D !
  • V 10,000 D 100,000 VD
    1,000,000,000

30
Inverted Index
31
Comments on Vector Space Models
  • Simple, mathematically based approach.
  • Considers both local (tf) and global (idf) word
    occurrence frequencies.
  • Provides partial matching and ranked results.
  • Tends to work quite well in practice despite
    obvious weaknesses.
  • Allows efficient implementation for large
    document collections.
  • Does not require all terms in the query

32
Web Search
33
Web Search
  • Application of IR to HTML documents on the World
    Wide Web.
  • Differences
  • Must assemble document corpus by spidering the
    web.
  • Can exploit the structural layout information in
    HTML (XML).
  • Documents change uncontrollably.
  • Can exploit the link structure of the web.

34
Web Search Using IR
IR System
35
The World Wide Web
  • Developed by Tim Berners-Lee in 1990 at CERN to
    organize research documents available on the
    Internet.
  • Combined idea of documents available by FTP with
    the idea of hypertext to link documents.
  • Developed initial HTTP network protocol, URLs,
    HTML, and first web server.

36
Web Search Recent History
  • In 1998, Larry Page and Sergey Brin, Ph.D.
    students at Stanford, started Google. Main
    advance is use of link analysis to rank results
    partially based on authority.

37
Web Challenges for IR
  • Distributed Data Documents spread over millions
    of different web servers.
  • Volatile Data Many documents change or
    disappear rapidly (e.g. dead links).
  • Large Volume Billions of separate documents.
  • Unstructured and Redundant Data No uniform
    structure, HTML errors, up to 30 (near)
    duplicate documents.
  • Quality of Data No editorial control, false
    information, poor quality writing, typos, etc.
  • Heterogeneous Data Multiple media types (images,
    video, VRML), languages, character sets, etc.

38
Growth of Web Pages Indexed
Google Inktomi AllTheWeb Teoma Altavista
Billions of Pages
SearchEngineWatch
Link to Note from Jan 2004
Assuming 20KB per page, 1 billion pages is about
20 terabytes of data.
39
Graph Structure in the Web
http//www9.org/w9cdrom/160/160.html
40
Selected Papers
41
1. A Taxonomy of Web Search
  • Andrei Broder, 2002
  • Query log analysis user survey
  • Classify web queries according to their intent
    into 3 classes
  • Navigational
  • Informational
  • Transactional
  • How global search engines evolved to deal with
    web-specific needs

42
2. Personalizing Search via Automated Analysis of
Interests and Activities
  • Jaime Teevan, Susan Dumais, Eric Horvitz, 2005
  • Formulate and study search personalization
    algorithms
  • Relevance feedback framework
  • Rich models of user interests built from
  • Previously issued queries
  • Previously visited Web pages
  • Documents and emails the user has read and created

43
3. Personalized Query Expansion for the Web
  • Paul Chirita, Claudiu Firan, Wolfgang Nejdl, 2007
  • Improve Web queries by expanding them
  • Five broad techniques for generating the
    additional query keywords
  • Term and compound level analysis
  • Global co-occurrence statistics
  • Use external thesauri

44
4. Boilerplate Detection using Shallow Text
Features
  • Christian Kohlschütter, Peter Fankhauser,
    Wolfgang Nejdl, 2010
  • Boilerplate text typically is not related to the
    main content
  • Analyze a small set of shallow text features for
    classifying the individual text elements in a Web
    page
  • Test impact of boilerplate removal to retrieval
    performance

45
For You to Choose
  • A Taxonomy of Web Search
  • Personalizing Search via Automated Analysis of
    Interests and Activities
  • Personalized Query Expansion for the Web
  • Boilerplate Detection using Shallow Text Features
About PowerShow.com