Web Scale Information Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Web Scale Information Discovery

Description:

'central' index. Automated Information Discovery ... http://www.loc.gov/z3950/agency/zing/zing-home.html. SRW (Search and Retrieval for the Web) ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 47
Provided by: carll8
Category:

less

Transcript and Presenter's Notes

Title: Web Scale Information Discovery


1
Web Scale Information Discovery
  • CS 431 20040329
  • Carl Lagoze Cornell University

Acknowledgements Luis Gravano Andreas
Paepcke Bill Arms
2
Search Strategies
  • Metadata harvesting
  • Automated discovery
  • Federated searching
  • Meta-searching

3
Web Search Strategies Metadata Harvesting
metadata
4
Web Search Strategies Metadata Harvesting
metadata
5
Searching via metadata harvesting - examples
  • NSDL http//www.nsdl.org
  • OAIster - http//oaister.umdl.umich.edu/o/oaister/

6
Web Search Strategies Crawling and Automated
Indexing
central index
?
7
Automated Information Discovery
Creating metadata records manually is
labor-intensive and hence expensive. The aim of
automated information discovery is for users to
discover information without using skilled human
effort to build indexes.
8
Resources for Automated Information Discovery
Computer power brute force computing ranking
methods automatic generation of metadata The
intelligence of the user browsing relevance
feedback information visualization
9
Similarity Ranking
  • Ranking methods using similarity
  • Measure the degree of similarity between a query
    and a document (or between two documents).
  • Basic technique is the vector space model with
    term weighting.

Similar
Requests
Documents
Similar How similar is document to a request?
10
Vector Space Methods Concept
n-dimensional space, where n is the total number
of different terms (words) in a set of
documents. Each document is represented by a
vector, with magnitude in each dimension equal to
the (weighted) number of times that the
corresponding term appears in the
document. Similarity between two documents is the
angle between their vectors. Much of this work
was carried out by Gerald Salton and colleagues
in Cornell's computer science department.
11
Example 1 Incidence Array
terms in d1 -gt ant ant bee terms in d2 -gt bee hog
ant dog terms in d3 -gt cat gnu dog eel fox
terms ant bee cat dog eel fox gnu
hog d1 1 1

d2 1 1 1
1 d3
1 1 1 1 1


Weights tij 1 if document i contains term j
and zero otherwise
12
Reasons for Term Weighting
  • Similarity using an incidence matrix measures the
  • occurrences of terms, but no other
    characteristics of
  • the documents.
  • Terms are more useful for information retrieval
    if
  • they
  • appear several times in one document (weighting
    by term frequency)
  • only appear in some documents (weighting by
    document frequency)
  • appear in short document (weighting by document
    length)

13
Inverse Document Frequency
Concept A term that occurs in a few documents is
likely to be a better discriminator that a term
that appears in most or all documents.
14
Issues in extending traditional IR for the Web
  • Traditional TREC benchmarks are relatively small
    scale (large is about 20GB)
  • Web queries are very short
  • Quality is an issue in ranking
  • Sex, lies, and the hidden web
  • Polysemy due to domain overlap
  • Web has context and hints
  • Structure of pages (e.g. html title might be
    rated higher)
  • Implicit metadata of link context
  • Anchor text of citing pages
  • Weighting influenced by citation structure
    (recall Garfield)

15
PageRank Algorithm (Google)
Concept The rank of a web page is higher if many
pages link to it. Links from highly ranked pages
are given greater weight than links from less
highly ranked pages.
16
Google Example
17
Adjacency Matrix
18
Normalize by Number of Links from Page
19
Iterate until convergence
20
Motivating the Damping Factor
21
PageRank with Damping Factor Intuitive Model
A user 1. Starts at a random page on the
web 2a. With probability p, selects any random
page and jumps to it 2b. With probability 1-p,
selects a random hyperlink from the current page
and jumps to the corresponding page 3. Repeats
Step 2a and 2b a very large number of times Pages
are ranked according to the relative frequency
with which they are visited.
22
The PageRank Iteration
The basic method iterates using the normalized
link matrix, B. wk Bwk-1 This w is the high
order eigenvector of B Google iterates using a
damping factor. The method iterates using a
matrix B', where B' dN (1 - d)B N is the
matrix with every element equal to 1/n. d is a
constant found by experiment.
23
Information Retrieval Using PageRank
Simple Method Consider all hits (i.e., all
document vectors that share at least one term
with the query vector) as equal. Display the hits
ranked by PageRank. The disadvantage of this
method is that it gives no attention to how
closely a document matches a query With dynamic
document sets, references patterns are calculated
for a set of documents that are selected based on
each individual query.
24
Web Search Strategies Federated Searching
Search Client
?
25
Z39.50
  • http//www.loc.gov/z3950/agency/

26
Aims of Z39.50
  • Permits one computer, the client, to search and
    retrieve information on another, the database
    server
  • Important both technically and for its wide use
    in library systems
  • Most development has concentrated on
    bibliographic data
  • Most implementations emphasize searches that use
    a bibliographic set of attributes to search
    databases of MARC records
  • Built on notion of common protocol as
    interoperability paradigm

27
Technical history
  • Z39.50
  • Developed for X.25 networks (connection
    orientation), conversion to run over TCP fitted
    later
  • Original concept in days when repeating a search
    was expensive computation (about 1980)

28
Z39.50 principles
  • Abstract view of database searching.
  • Server stores a set of databases with searchable
    indexes
  • Interactions are based on a session
  • The client opens a connection with the server,
    carries out a sequence of interactions and then
    closes the connection.
  • During the course of the session, both the server
    and the client remember the state of their
    interaction.

29
State
  • Z39.50
  • The server carries out the search and builds a
    results set
  • Server saves the results set.
  • Subsequent message from the client can reference
    the result set.
  • Thus the client can modify a large set by
    increasingly precise requests, or can request a
    presentation of any record in the set, without
    searching entire database.

30
Z 39.50 services
init -- client connects to the server and
exchanges initial information, e.g., preferred
message size explain -- client inquires of the
server what databases are available for
searching, the fields that are available, the
syntax and formats supported, and other
options search -- client presents a query to a
database choices of syntax for specifying
searches only Boolean queries widely
implemented one or more records may be
returned to the client
31
Z 39.50 services
manipulation of results sets -- e.g., sort or
delete present -- requests the server to send
specified records from the results set to the
client in a specified format options for
controlling content and formats
for managing large records or large results sets
32
Problems with Z39.50
  • Very difficult to implement
  • There are freely available implementations, but
    they are complex
  • Outdated assumptions
  • Searching is expensive computationally
  • Bandwidth is limited (ASN.1 compression)
  • Originally designed for bibliographic record
    retrieval, and not full documents or other
    objects
  • Overspecified
  • Assumes questionable user model (stateful)

33
ZING update of Z39.50 concepts
  • http//www.loc.gov/z3950/agency/zing/zing-home.htm
    l
  • SRW (Search and Retrieval for the Web)
  • Retain basic Z39.50 concepts (state, explain,
    flexible access points or metadata formats)
  • Simplifications and modernizations (statefulness,
    use of XML/SOAP)
  • CQL (Common Query Language)
  • Expressive common query language with XML
    definition

34
Simple Digital Library Interoperability Protocol
  • http//www-diglib.stanford.edu/testbed/doc2/SDLIP
    /

35
SDLIP
  • Compromise between a full-scale, all encompassing
    search middleware design such as Z39.50 and the
    anything goes approach typical for ad-hoc
    search interface design on web
  • Support for stateful and stateless operation by
    the server
  • Build on notion of mediators as interoperability
    paradigm
  • Support for thin clients, such as handheld
    devices
  • Developed jointly by Stanford, Berkeley, and UC
    Santa Barbara

36
SDLIP search middleware
37
SDLIP Interfaces
  • Search Interface defines simple query language,
    protocol can then include other languages
  • Result Interface parking meter metaphor
    supports varying notions of results sets
  • Source Metadata Interface provides extension
    mechanism through discovery server capabilities

38
Result Access Interface
  • This interface allows client applications to
    access the set of result documents, wherever that
    set is maintained
  • Four services
  • getSessionInfo
  • getDocs
  • extendStateTimeout
  • cancelRequest

39
Source Metadata Interface
  • Provides information about the service and server
    itself, such as
  • Collections served
  • Collection metadata/content information
  • Searchable properties
  • Three operations
  • getInterface
  • getSubcollectionInfo
  • getPropertyInfo

40
Web Search Strategies - Metasearching
Metasearch Engine
?
41
What is Metasearching?
  • Given many document sources and a query, a
    metasearcher
  • Finds the good sources for the query
  • Evaluates the query at these sources
  • Merges the results from these sources

Metasearcher
Existing Web Application
Unindexed Documents
Legacy Database / WAIS / etc.
42
Metasearching Issues
  • How to query different types of sources?
  • How to combine results and rankings from multiple
    data sources?

Metasearcher
http///getTitle? titlebiomedical
SELECT title FROM articles . . .
grep biomedical .txt
43
Metasearching Issues . . . Contd
  • How to choose among multiple data sources?
  • How to get metadata about multiple data sources?

Metasearcher
Best http//.?getMetaData Worst Hi. What do
you have?
cat .txt
SELECT SCHEMA .
44
STARTS/SDARTS
  • http//sdarts.cs.columbia.edu/default.html

45
STARTS
  • Stanford Protocol Proposal for Internet Retrieval
    and Search
  • Joint work of Stanford Digital Library Project
    and Cornell Digital Library Research Group
  • SDARTS current work at Columbia to extend
    STARTS
  • Integrate with SDLIP and metadata harvesting
    (OAI-PMH)
  • Include deep web automated content summary
    concepts

46
Different text search engines are largely
incompatible
  • Different query languages (the query-language
    problem)
  • Different ranking algorithms (the rank-merging
    problem)
  • No exported information about sources (the
    metadata problem)

47
SDLIP search middleware
48
Rank Merging
  • Return information in query result to allow rank
    merging
  • unnormalized score of the document
  • statistics about each query term

49
We cannot merge document ranks from different
sources directly
  • Search engines use different ranking algorithms
  • DB1 (doc1, 0.7), (doc2, 0.3)
  • DB2 (doc3, 1000), (doc4, 400)
  • Merged rank?
  • Some algorithms depend on the source
    characteristics

50
Extra information helps merge document ranks
meaningfully
  • Sources return query results and statistics
  • Query "distributed databases"
  • DB1 (doc1, 0.7)
  • "distributed" appears 3 times in
    doc1"databases" appears 5 times in doc1

51
Motivating Source MetadataRouting Problem -
Disjoint Search Sources
Hopcroft I1, I3 Hartmanis I3 Tarjan I1,
I2 Wilensky I2
I1,I3
doc1, doc2
doc8
Content Summary
I1
I2
I3
52
Source Metadata
  • Data to help select the right sources for a query
  • source metadata attributes - what the source
    engine can do
  • source content summary - what the source engine
    can search
  • Simplified form of Z39.50 explain service

53
Source metadata attributes
  • Fields Supported
  • Modifiers Supported
  • Score Range
  • Ranking Algorithm ID

54
Source Content Summary
  • For each source
  • Vocabulary
  • Document frequency for each word
  • Total number of postings for each word
  • Number of documents
  • Implementation of GLOSS work
  • GlOSS Text-Source Discovery over the Internet,
    L. Gravano, H. Garcia-Molina, A. Tomasic, in ACM
    Transactions on Database Systems, vol. 24, no. 2,
    Jun. 1999

55
Deep Web Content Summary Extraction via Focused
Query Probing
Write a Comment
User Comments (0)
About PowerShow.com