Data Integration and Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Data Integration and Information Retrieval

Description:

Homework 3 handed out. Midterm on Thu 3/20, 80 minutes, closed-book ... 'March Madness' Find information on college basketball teams which: (1) are ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 45
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Data Integration and Information Retrieval


1
Data Integration andInformation Retrieval
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 455 / 555 Internet and Web Systems
  • February 18, 2008

Some slides by Berthier Ribeiro-Neto
2
Reminders Announcements
  • Homework 3 handed out
  • Midterm on Thu 3/20, 80 minutes, closed-book
  • Past midterm available on the web site

3
Where We Left Off
  • We could use XQueries to translate data from one
    XML schema to another

4
Translating Values with a Concordance Table
pid PennID n name s ssn tr treatment f
PennID t ssn
  • for p in doc (student.xml) /db/student,
  • pid in p/pennid/text(), n in
    p/name/text(),d in doc(dental.xml)/db/patient
    , s in d/ssn/text(), tr in
    d/treatment/text(),m in doc (concord.xml)
    /db/mapping, f in m/from/text(),
    t in m/to/text()where pid f and s t
  • return ltstudentgt ltnamegt n lt/namegt lttreatmen
    tgt tr lt/treatmentgt lt/studentgt

student.xml ltstudentgtltpennidgt12346lt/pennidgt
ltnamegtMary McDonaldlt/namegt
lttakinggtltsemgtF03lt/semgt
ltclassgtcse330lt/classgtlt/takinggt
lt/studentgt dental.xml ltpatientgtltssngt323-468-12
12lt/ssngt lttreatmentgtDental
sealantlt/treatmentgt lt/patientgt concord.xml ltm
appinggt ltfromgt12346lt/fromgt lttogt323-468-1212lt/t
ogt lt/mappinggt
5
Drawbacks to Point-to-Point Mappings
  • They can get data from one source to another, but
    what if you want to see elements that arent
    shared?
  • Painful to create n2 mappings
  • Sometimes we dont actually want to ship the data
    from one source to another, but to see both
  • We dont want to put Barnes Nobles inventory
    INTO Amazons but we want to see books from
    both
  • Two alternate strategies
  • Hierarchy map everything to a mediator
  • Peer-to-peer map data across a web of mappings
    (PDMS see CIS 650)

6
Data Integration and Warehousing
  • Create a middleware mediator or data
    integration system over the sources
  • All sources are mapped to a common mediated
    schema
  • Warehouse approach actually has a central
    database, and load data from the sources into it
  • Virtual approach has just a schema it consults
    sources to answer each query
  • The mediator accepts queries over the central
    schema and returns all relevant answers

7
Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Query-basedSchema Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Data
8
Mediator / Virtual Integration Systems
  • The subject of much research since the 80s and
    especially 90s
  • Examples TSIMMIS, Information Manifold, MIX,
    Garlic,
  • Original focus on Web
  • Real-world integration companies (IBM,
    BEA/Oracle, Actuate, ) are focusing on the
    enterprise more !
  • A common model
  • Take the source data
  • Define a schema mapping that produces content for
    the mediated schema, based on the source data
  • The data for the mediated schema is the union
    of all of the mappings

9
Answering Queries
  • Based on view unfolding composing a query and
    view
  • The query is being posed over the mediated schema
  • for b in document(dblp.xml)/root/bookwhere
    b/title/text Distributed Systems and
    b/author/text() Tanenbaumreturn b
  • Wrappers are responsible for converting data from
    the source into a subset of the mediated schema
  • for c in sql(select author,year,title from
    CISbook)return ltbookgt c/ lt/bookgt

10
The Mediated Schema as a Union of Views from
Wrappers
  • Wrappers have names, some sort of output schema
  • define function GetCISBooks() as book for c
    in sql(select author,year,title from
    CISbook)return ltbookgt c/ lt/bookgt
  • This gets unioned with output from other
    results
  • return ltrootgt
  • GetCISBooks()
  • GetEEBooks()
  • lt/rootgt

book
author
year
title
11
How to Answer the Query
  • Given our query
  • for b in document(dblp.xml)/root/bookwhere
    b/title/text() Distributed Systems and
    b/author/text() Tanenbaumreturn b
  • We want to find all wrapper definitions that
    output the right structure to match our query
  • Book elements with titles and authors (and any
    other attributes)

12
Query Composition with Views
  • We find all views that define book with author
    and title, and we compose the query with each of
    these
  • In our example, we find one wrapper definition
    that matches
  • define function GetCISBooks() as book for b
    in sql(select author,year,title from
    CISbook)return ltbookgt b/ lt/bookgt
  • for b in document(mediated-schema)/root/bookwh
    ere b/title/text() Distributed Systems and
    b/author/text() Tanenbaumreturn b

return ltrootgt GetCISBooks() lt/rootgt
13
Making It Work
  • for b in doc ()/root/bookwhere
    b/title/text() Dist. Systems and
    b/author/text() Tanenbaumreturn b

root
book
author
year
title
c/author
c/year
c/title
c
author year title
14
The Final Step Unfolded View
  • The query and the view definition are merged (the
    view is unfolded), yielding, e.g.
  • for b in sql(select author,title,year from
    CISbook where authorTanenbaum)where
    b/title/text() Distributed Systems return b

15
Summary Mapping, Integrating, and Sharing Data
  • Based on XQuery rather than XSLT
  • Views (in XQuery, functions) as the bridge
    between schemas
  • Joins and nesting are important in creating these
    views
  • Can do point-to-point mappings to exchange data
  • Very common approach mediated schema or
    warehouse
  • Create a central schema may be virtual
  • Map sources to it
  • Pose queries over this
  • UDDI versus this approach?
  • What about search and its relationship to
    integration? In particular, search over Amazon,
    Google Maps, Google, Yahoo,

16
Web Search
  • Goal is to find information relevant to a users
    interests
  • Challenge 1 A significant amount of content on
    the web is not quality information
  • Many pages contain nonsensical rants, etc.
  • The web is full of misspellings, multiple
    languages, etc.
  • Many pages are designed not to convey information
    but to get a high ranking (e.g., search engine
    optimization)
  • Challenge 2 billions of documents
  • Challenge 3 hyperlinks encode information

17
Our Discussion of Web Search
  • Begin with traditional information retrieval
  • Document models
  • Stemming and stop words
  • Web-specific issues
  • Crawlers and robots.txt
  • Scalability
  • Models for exploiting hyperlinks in ranking
  • Google and PageRank
  • Latent Semantic Indexing

18
Information Retrieval
  • Traditional information retrieval is basically
    text search
  • A corpus or body of text documents, e.g., in a
    document collection in a library or on a CD
  • Documents are generally high-quality and designed
    to convey information
  • Documents are assumed to have no structure beyond
    words
  • Searches are generally based on meaningful
    phrases, perhaps including predicates over
    categories, dates, etc.
  • The goal is to find the document(s) that best
    match the search phrase, according to a search
    model
  • Assumptions are typically different from Web
    quality text, limited-size corpus, no hyperlinks

19
Motivation for Information Retrieval
  • Information Retrieval (IR) is about
  • Representation
  • Storage
  • Organization of
  • And access to information items
  • Focus is on users information need rather than
    a precise query
  • March Madness Find information on college
    basketball teams which (1) are maintained by a
    US university and (2) participate in the NCAA
    tournament
  • Emphasis is on the retrieval of information (not
    data)

20
Data vs. Information Retrieval
  • Data retrieval, analogous to database querying
    which docs contain a set of keywords?
  • Well-defined, precise logical semantics
  • A single erroneous object implies failure!
  • Information retrieval
  • Information about a subject or topic
  • Semantics is frequently loose we want
    approximate matches
  • Small errors are tolerated (and in fact
    inevitable)
  • IR system
  • Interpret contents of information items
  • Generate a ranking which reflects relevance
  • Notion of relevance is most important needs a
    model

21
Basic Model
Docs
Index Terms
doc
match
Information Need
Ranking
?
query
22
Information Retrieval as a Field
  • IR addressed many issues in the last 20 years
  • Classification and categorization of documents
  • Systems and languages for searching
  • User interfaces and visualization of results
  • Area was seen as of narrow interest libraries,
    mainly
  • Sea-change event the advent of the web
  • Universal library
  • Free (low cost) universal access
  • No central editorial board
  • Many problems in finding information IR seen as
    key to finding the solutions!

23
The Full Info Retrieval Process
Text
Browser / UI
user interest
Text
Text Processing and Modeling
logical view
logical view
Query Operations
Indexing
user feedback
Crawler/ Data Access
inverted index
query
Searching
Index
retrieved docs
Documents (Web or DB)
Ranking
ranked docs
24
Terminology
  • IR systems usually adopt index terms to process
    queries
  • Index term
  • a keyword or group of selected words
  • any word (more general)
  • Stemming might be used
  • connect connecting, connection, connections
  • An inverted index is built for the chosen index
    terms

25
Whats a Meaningful Result?
  • Matching at index term level is quite imprecise
  • Users are frequently dissatisfied
  • One problem users are generally poor at posing
    queries
  • Frequent dissatisfaction of Web users (who often
    give single-keyword queries)
  • Issue of deciding relevance is critical for IR
    systems ranking

26
Rankings
  • A ranking is an ordering of the documents
    retrieved that (hopefully) reflects the relevance
    of the documents to the user query
  • A ranking is based on fundamental premises
    regarding the notion of relevance, such as
  • common sets of index terms
  • sharing of weighted terms
  • likelihood of relevance
  • Each set of premisses leads to a distinct IR model

27
Types of IR Models
U s e r T a s k
Retrieval Adhoc Filtering
Browsing
28
Classic IR Models Basic Concepts
  • Each document represented by a set of
    representative keywords or index terms
  • An index term is a document word useful for
    remembering the document main themes
  • Traditionally, index terms were nouns because
    nouns have meaning by themselves
  • However, search engines assume that all words are
    index terms (full text representation)

29
Classic IR Models Ranking
  • Not all terms are equally useful for representing
    the document contents less frequent terms allow
    identifying a narrower set of documents
  • The importance of the index terms is represented
    by weights associated to them
  • Let
  • ki be an index term
  • dj be a document
  • wij is a weight associated with (ki,dj)
  • The weight wij quantifies the importance of the
    index term for describing the document contents

30
Classic IR Models Notation
  • ki is an index term (keyword)
  • dj is a document
  • t is the total number of docs
  • K (k1, k2, , kt) is the set of all index
    terms
  • wij gt 0 is a weight associated with (ki,dj)
  • wij 0 indicates that term does not belong to
    doc
  • vec(dj) (w1j, w2j, , wtj) is a weighted
    vector associated with the document dj
  • gi(vec(dj)) wij is a function which returns
    the weight associated with pair (ki,dj)

31
Boolean Model
  • Simple model based on set theory
  • Queries specified as boolean expressions
  • precise semantics
  • neat formalism
  • q ka ? (kb ? ?kc)
  • Terms are either present or absent. Thus,
    wij ? 0,1
  • An example query
  • q ka ? (kb ? ?kc)
  • Disjunctive normal form vec(qdnf) (1,1,1)
    ? (1,1,0) ? (1,0,0)
  • Conjunctive component vec(qcc) (1,1,0)

32
Boolean Model for Similarity
  • q ka ? (kb ? ?kc)
  • sim(q,dj) 1 if ? vec(qcc) s.t.
    (vec(qcc) ? vec(qdnf)) ? (?ki,
    gi(vec(dj)) gi(vec(qcc))) 0 otherwise


33
Drawbacks of Boolean Model
  • Retrieval based on binary decision criteria with
    no notion of partial matching
  • No ranking of the documents is provided (absence
    of a grading scale)
  • Information need has to be translated into a
    Boolean expression which most users find awkward
  • The Boolean queries formulated by the users are
    most often too simplistic
  • As a consequence, the Boolean model frequently
    returns either too few or too many documents in
    response to a user query

34
Vector Model
  • A refinement of the boolean model, which focused
    strictly on exact matchines
  • Non-binary weights provide consideration for
    partial matches
  • These term weights are used to compute a degree
    of similarity between a query and each document
  • Ranked set of documents provides for better
    matching

35
Vector Model
  • Define
  • wij gt 0 whenever ki ? dj
  • wiq gt 0 associated with the pair (ki,q)
  • vec(dj) (w1j, w2j, ..., wtj) vec(q)
    (w1q, w2q, ..., wtq)
  • With each term ki , associate a unit vector
    vec(i)
  • The unit vectors vec(i) and vec(j) are assumed
    to be orthonormal (i.e., index terms are assumed
    to occur independently within the documents)
  • The t unit vectors vec(i) form an orthonormal
    basis for a t-dimensional space
  • In this space, queries and documents are
    represented as weighted vectors

36
Vector Model
j
dj
?
  • Sim(q,dj) cos(?) vec(dj) ? vec(q) /
    dj q ? wij wiq / dj q
  • Since wij gt 0 and wiq gt 0, 0 sim(q,dj)
    1
  • A document is retrieved even if it matches the
    query terms only partially

q
i
37
Weights in the Vector Model
  • Sim(q,dj) ? wij wiq / dj q
  • How do we compute the weights wij and wiq?
  • A good weight must take into account two effects
  • quantification of intra-document contents
    (similarity)
  • tf factor, the term frequency within a document
  • quantification of inter-documents separation
    (dissimilarity)
  • idf factor, the inverse document frequency
  • wij tf(i,j) idf(i)

38
TF and IDF Factors
  • Let
  • N be the total number of docs in the collection
  • ni be the number of docs which contain ki
  • freq(i,j) raw frequency of ki within dj
  • A normalized tf factor is given by
  • f(i,j) freq(i,j) / max(freq(l,j))
  • where the maximum is computed over all terms
    which occur within the document dj
  • The idf factor is computed as
  • idf(i) log (N / ni)
  • the log is used to make the values of tf and
    idf comparable.
  • It can also be interpreted as the amount of
    information associated with the term ki

39
Vector ModelExample 1
40
Vector ModelExample 1I
41
Vector ModelExample III
42
Vector Model, Summarized
  • The best term-weighting schemes tf-idf weights
  • wij f(i,j) log(N/ni)
  • For the query term weights, a suggestion is
  • wiq (0.5 0.5 freq(i,q) /
    max(freq(l,q)) log(N / ni)
  • This model is very good in practice
  • tf-idf works well with general collections
  • Simple and fast to compute
  • Vector model is usually as good as the known
    ranking alternatives

43
Pros Cons of Vector Model
  • Advantages
  • term-weighting improves quality of the answer set
  • partial matching allows retrieval of docs that
    approximate the query conditions
  • cosine ranking formula sorts documents according
    to degree of similarity to the query
  • Disadvantages
  • assumes independence of index terms not clear if
    this is a good or bad assumption

44
Comparison of Classic Models
  • Boolean model does not provide for partial
    matches and is considered to be the weakest
    classic model
  • Some experiments indicate that the vector model
    outperforms the third alternative, the
    probabilistic model, in general
  • Recent IR research has focused on improving
    probabilistic models but these havent made
    their way to Web search
  • Generally we use a variation of the vector model
    in most text search systems
Write a Comment
User Comments (0)
About PowerShow.com