CSE490i Advanced Internet Systems - PowerPoint PPT Presentation

About This Presentation
Title:

CSE490i Advanced Internet Systems

Description:

and all the big kids are trying to gobble up anyone who is even going through ... 1867 US purchases Alaska from Russia for $7.2 million (2 cents/acre) ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 90
Provided by: supp176
Category:
Tags: acre | advanced | an | big | cse490i | how | internet | is | systems

less

Transcript and Presenter's Notes

Title: CSE490i Advanced Internet Systems


1
Topic 3 Finding, Representing Exploiting
Structure
Getting Structure Allow structure specification
languages ? XML? More structured than text
and less structured than databases If structure
is not explicitly specified (or is obfuscated),
can we extract it? ?Wrapper
generation/Information Extraction Using
Structure For retrieval ?Extend IR
techniques to use the additional structure For
query processing (Joins/Aggregations etc)
?Extend database techniques to use the partial
structure For reasoning with structured
knowledge ?Semantic web ideas.. Structure in
the context of multiple sources How to align
structure How to support integrated querying on
pages/sources (after alignment)
2
Structure
An employee record
A generic web page containing text
A movie review
  • How will search and querying on these three types
    of data differ?

Semi-Structured
3
Structure helps querying
  • Expressive queries
  • Give me all pages that have key words Get Rich
    Quick
  • Give me the social security numbers of all the
    employees who have stayed with the company for
    more than 5 years, and whose yearly salaries are
    three standard deviations away from the average
    salary
  • Give me all mails from people from ASU written
    this year, which are relevant to get rich quick

4
Adapting old disciplines for Web-age
  • Information (text) retrieval
  • Scale of the web
  • Hyper text/ Link structure
  • Authority/hub computations
  • Databases
  • Multiple databases
  • Heterogeneous, access limited, partially
    overlapping
  • Network (un)reliability
  • Datamining Machine Learning/Statistics/Databases
  • Learning patterns from large scale data

5
Why do we care about databases?
  • Three reasons
  • Deep web is all databases
  • We can do better with structured data
  • Exposing databases on web changes their
    clientele..

6
Deep Web is databases..
  • The crawlable web pages are just the tip of a
    huge ice berg that is deep web
  • Many web sites have huge backend databases that
    generate pages dynamically in response to queries
  • Airline fare databases News paper classifieds
    etc.
  • By some estimates, deep web is 2 orders of
    magnitude bigger than the shallow (html page)
    web
  • We need to exploit deep web
  • Crawl/index deep web
  • Select databases relevant to a query
  • Provide information aggregation/integration
    services over deep web databases
  • ..and all the big kids are trying to gobble up
    anyone who is even going through the motions of
    doing these..
  • which leads to several DB challenges not
    addressed in traditional DBs
  • Wrapper generation
  • Schema mapping Structure alignment
  • (automated) form filling
  • Query optimization
  • Learning source profiles

7
Databases offer lessons on exploiting structure
  • We argued that structure (and semantics) help
    querying
  • If there is structure (as in databases) we can
    exploit it
  • Databases is an existing technology for
    exploiting some forms of structure
  • SQL may not look like much, but it is more
    expressive than keyword queries!
  • If not, we can extract structure and then exploit
    it
  • Challenges
  • Techniques for extracting information (NLP-lite)
  • Languages for representing/handling
    Semi-structured data
  • Standards for supporting/exploiting semantic
    tagging

8
Before we play havoc with databases, lets
quickly review the traditional art of db
managementso we know all that needs to change
9
This Day in History
  • 1867 US purchases Alaska from Russia for
    7.2 million (2 cents/acre)
  • 1953 Einstein announces revised unified field
    theory
  • 1954 Test Cricket debut of Sir Garry Sobers
    vs. England
  • 1981 President Reagan shot wounded by John
    W Hinckley Jr
  • 2004 The first ever regular class of Rao
    taught by someone other than Rao

10
Structured data..
  • Focus on text data till date.
  • However, a lot of the data available on the web
    is actually from (semi-)structured databases !!!!
  • They do their best to look like they are text
    sources
  • What are the issues and opportunities brought up
    by the presence of such sources on the web?

11
Databases !!!??? you may have used
12
Is the a DBMS?
Skeptics corner
  • Fairly sophisticated search available
  • crawler indexes pages on the web
  • Keyword-based search for pages
  • But, currently
  • data is mostly unstructured and untyped
  • search only
  • cant modify the data
  • cant get summaries, complex combinations of data
  • Web sites typically have a DBMS in the background
    to provide these functions.
  • They dynamically convert (wrap) the structured
    data into readable English
  • ltIndia, New Delhigt gt The capital of India is
    New Delhi.
  • So, if we can unwrap the text, we have
    structured data!
  • Note also that such dynamic pages cannot be
    crawled...
  • The (coming) Semi-structured web
  • Most pages are at least semi-structured
  • XML standard is expected to ease the
    presentation/on-the-wire transfer of such pages.
    (BUT..)
  • The Services
  • Travel services, mapping services
  • The Sensors

13
Structure
An employee record
A generic web page containing text
A movie review
  • How will search and querying on these three types
    of data differ?

Semi-Structured
14
Search vs. Query
  • What if you wanted to find out which actors
    donated to Al Gores presidential campaign?
  • Try actors donated to gore in your favorite
    search engine.

15
Structure helps querying
  • Expressive queries
  • Give me all pages that have key words Get Rich
    Quick
  • Give me the social security numbers of all the
    employees who have stayed with the company for
    more than 5 years, and whose yearly salaries are
    three standard deviations away from the average
    salary
  • Give me all mails from people from ASU written
    this year, which are relevant to get rich quick
  • Efficient searching
  • equality vs. similarity
  • range-limited search

16
Why use a DBMS in your website?
  • Suppose we are building web-based music
    distribution site.
  • Several questions arise
  • How do we store the data? (file organization,
    etc.)
  • How do we query the data? (write programs)
  • Make sure that updates dont mess things up?
  • Provide different views on the data? (registrar
    versus students)
  • How do we deal with crashes?
  • Way too complicated!
  • Buy a database system!

17
What Is a Database System?
  • Database
    a very
    large, integrated collection of data.
  • Models a real-world enterprise
  • Entities (e.g., teams, games)
  • Relationships
  • (e.g., The Patriots are playing in The
    Superbowl)
  • More recently, also includes active components ,
    often called business logic. (e.g., the BCS
    ranking system)
  • A Database Management System (DBMS) is a software
    system designed to store, manage, and facilitate
    access to databases.

18
Functionality of a DBMS
  • Data Dictionary Management
  • Storage management
  • Data storage Definition Language (DDL)
  • High level query and data manipulation language
  • SQL/XQuery etc.
  • May tell us what we are missing in text-based
    search
  • Efficient query processing
  • May change in the internet scenario
  • Transaction processing
  • Resiliency recovery from crashes,
  • Different views of the data, security
  • May be useful to model a collection of databases
    together
  • Interface with programming languages

19
Traditional Database Architecture
20
Building an Application with a Database System
  • Requirements modeling (conceptual, pictures)
  • Decide what entities should be part of the
    application and how they should be linked.
  • Schema design and implementation
  • Decide on a set of tables, attributes.
  • Define the tables in the database system.
  • Populate database (insert tuples).
  • Write application programs using the DBMS
  • Now much easier, with data management API

21

Conceptual Modeling
ssn
22
Data Models
  • A data model is a collection of concepts for
    describing data.
  • A schema is a description of a particular
    collection of data, using a given data model.
  • The relational model of data is the most widely
    used model today.
  • Main concept relation, basically a table with
    rows and columns.
  • Every relation has a schema, which describes the
    columns, or fields.

23
Levels of Abstraction
  • Views describe how users see the data.
  • Conceptual schema defines logical structure
  • Physical schema describes the files and indexes
    used.

24
Example University Database
  • Conceptual schema
  • Students(sid string, name string,
    login string, age integer, gpareal)
  • Courses(cid string, cnamestring,
    creditsinteger)
  • External Schema (View)
  • Course_info(cidstring,enrollmentinteger)
  • Physical schema
  • Relations stored as unordered files.
  • Index on first column of Students.

If five people are asked to come up with a schema
for the data, what are the odds that they will
come up with the same schema?
25
Data Independence
  • Applications insulated from
  • how data is structured and stored.
  • Logical data independence Protection from
    changes in logical structure of data.
  • Physical data independence Protection from
    changes in physical structure of data.
  • Q Why are these particularly important for DBMS?

26
Schema Design Implementation
  • Table Students
  • Separates the logical view from the physical view
    of the data.

27
Terminology
Attribute names
Students
tuples
(Arity3)
28
Querying a Database
  • Find all the students taking CSE594 in Q1, 2004
  • S(tructured) Q(uery) L(anguage)
  • select E.name
  • from Enroll E
  • where E.courseCS490i and
  • E.quarterWinter, 2000
  • Query processor figures out how to answer the
    query efficiently.

29
Relational Algebra
  • Operators
  • tuple sets as input, new set as output
  • Basic Binary Set Operators
  • Result is table (set) with same attributes
  • Sets must be compatible!
  • R1(A1,A2,A3) ? R2(B1,B2,B3)
  • ? Domain(Ai) Domain(Bi)
  • Union
  • All tuples in either R1 or in R2
  • Intersection
  • All tuples in both R1 and R2
  • Difference
  • All tuples in R1 but not in R2
  • Complement
  • All tuples not in R1
  • Selection, Projection, Cartesian Product, Join

whats the universe?
30
Selection s
  • Grab a subset of the tuples in a relation that
    satisfy a given condition
  • Use and, or, not, gt, lt to build condition
  • Unary operation returns set with same
    attributes, but selects rows

31
Selection Example
Employee
SSN
Name
DepartmentID
Salary
999999999
John
1
30,000
777777777
Tony
1
32,000
888888888
Alice
2
45,000
32
Projection p
  • Unary operation, selects columns
  • Returned schema is different,
  • So returned tuples are not subset of original set
  • Contrast with selection
  • Eliminates duplicate tuples

33
(No Transcript)
34
Cartesian Product X
  • Binary Operation
  • Result is set of tuples combining all elements of
    R1 with all elements of R2, for R1 ? R2
  • Schema is union of Schema(R1) Schema(R2)
  • Notice we could do selection on result to get
    meaningful info!

35
Cartesian Product Example
36
Join
  • Most common (and exciting!) operator
  • Combines 2 relations
  • Selecting only related tuples
  • Result has all attributes of the two relations
  • Equivalent to
  • Cross product followed by selection followed by
    Projection
  • Equijoin
  • Join condition is equality between two attributes
  • Natural join
  • Equijoin on attributes of same name
  • result has only one copy of join condition
    attribute

37
Example Natural Join
38
Complex Queries
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number, city)
Find phone numbers of people who bought gizmos
from Fred. Find telephony products that
somebody bought
39
Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who bought American products and they
live in Seattle. Ex 5 Find people who bought
stuff from Joe or bought products
from a company whose stock prices is more than
50.
40
SQL Introduction
Standard language for querying and manipulating
data Structured Query
Language
Many standards out there SQL92, SQL2, SQL3,
SQL99 Vendors support various subsets of
these (but well only discuss a subset of what
they support) Basic form syntax on relational
algebra (but many other features too) Select
attributes From relations (possibly
multiple, joined) Where conditions
(selections)
41
Selections s
SELECT FROM
Company WHERE countryUSA AND
stockPrice gt 50 You can use
Attribute names of the relation(s) used in the
FROM. Comparison operators , ltgt,
lt, gt, lt, gt Apply arithmetic
operations stockprice2 Operations
on strings (e.g., for concatenation).
Lexicographic order on strings.
Pattern matching s LIKE p Special
stuff for comparing dates and times.
42
Projection p
Select only a subset of the attributes
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
Rename the attributes in the resulting table
SELECT name AS company,
stockprice AS price FROM
Company WHERE countryUSA AND
stockPrice gt 50
43
Ordering the Results
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
ORDERBY country, name
Ordering is ascending, unless you specify the
DESC keyword. Ties are broken by the second
attribute on the ORDERBY list, etc.
44
Join
SELECT name, store
FROM Person, Purchase WHERE
per-namebuyer AND citySeattle
AND
productgizmo Product ( pname, price,
category, maker) Purchase (buyer, seller,
store, product) Company (cname, stock price,
country) Person( per-name, phone number, city)

45
Disambiguating Attributes
Find names of people buying telephony products
SELECT Person.name FROM
Person, Purchase, Product WHERE
Person.namebuyer
AND productProduct.name
AND Product.categorytelephony Product (
name, price, category, maker) Purchase (buyer,
seller, store, product) Person( name, phone
number, city)
46
Tuple Variables
Find pairs of companies making products in the
same category
SELECT product1.maker, product2.maker
FROM Product AS product1, Product AS
product2 WHERE
product1.category product2.category
AND product1.maker ltgt
product2.maker
Product ( name, price, category, maker)
47
Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
product) Company (cname, stock-price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who live in Seattle and who bought American
products. Ex 5 Find people who bought stuff
from Joe or bought products from a
company whose stock prices is more than 50.
48
Views
49
Defining Views
(Virtual) Views are macro relations defined
in terms of base relations (they may or may not
be physically stored) They are used mostly in
order to simplify complex queries and to define
conceptually different views of the database to
different classes of users. View purchases of
telephony products CREATE VIEW
telephony-purchases AS SELECT product, buyer,
seller, store FROM Purchase, Product WHERE
Purchase.product Product.name
AND Product.category telephony
50
A Different View
CREATE VIEW Seattle-view AS SELECT
buyer, seller, product, store FROM
Person, Purchase WHERE Person.city
Seattle AND
Person.name Purchase.buyer
We can later use the views SELECT
name, store FROM Seattle-view,
Product WHERE Seattle-view.product
Product.name AND
Product.category shoes
Whats really happening when we query a view??
51
Updating Views
How can I insert a tuple into a table that
doesnt exist? CREATE VIEW bon-purchase AS
SELECT store, seller, product FROM
Purchase WHERE store The Bon
Marche If we make the following insertion
INSERT INTO bon-purchase VALUES
(the Bon Marche, Joe, Denby Mug) We can
simply add a tuple (the Bon Marche,
Joe, NULL, Denby Mug) to relation Purchase.
52
Non-Updatable Views
Given Purchase (buyer, seller, store,
product) Person( name, phone-num, city)
CREATE VIEW Seattle-view AS SELECT
seller, product, store FROM Person,
Purchase WHERE Person.city Seattle
AND Person.name
Purchase.buyer
Why non-updatable?
How can we add the following tuple to the view?
(Joe, Shoe Model 12345, Nine West)
53
Materialized Views
  • Views whose corresponding queries have been
    executed and the data is stored in a separate
    database
  • Uses Caching
  • Issues
  • Using views in answering queries
  • Normally, the views are available in addition to
    database
  • (so, views are local caches)
  • In information integration, views may be the only
    things we have access to.
  • An internet source that specializes in woody
    allen movies can be seen as a view on a database
    of all movies. Except, there is no database out
    there which contains all movies..
  • Maintaining consistency of materialized views

54
Issues w.r.t. Databases on the Web
  • Information Extraction (invert the tuple to text
    transformation)
  • Support lay user queries
  • More flexible queries
  • Exact (SQL) vs Approximate/Similar (Text search?)
  • On semi-structured databases
  • Joins over text attributes?
  • Exact (SQL) vs Approximate/Similar !!!!!
  • Support integration/aggregation of multiple
    databases
  • Take a query from the user and send it to all
    relevant databases
  • TONS of challenges

55
Imprecise Queries
  • Increasing number of Web accessible databases
  • E.g. bibliographies, reservation systems,
    department catalogs etc
  • Support for precise queries only exactly
    matching tuples
  • Difficulty in extracting desired information
  • Limited query capabilities provided by form based
    query interface
  • Lack of schema/domain information
  • Increasing complexity of types of data e.g.
    hyptertext, images etc
  • Often times user wants about the same instead
    of exact
  • Bibliography search find similar publications

Solution Provide answers closely matching query
constraints
56
Query Optimization
57
Query Optimization
Goal
Imperative query execution plan
Declarative SQL query
SELECT S.buyer FROM Purchase P, Person Q WHERE
P.buyerQ.name AND Q.cityseattle AND
Q.phone gt 5430000
  • Inputs
  • the query
  • statistics about the data (indexes,
    cardinalities, selectivity factors)
  • available memory

Ideally Want to find best plan. Practically
Avoid worst plans!
58
(On-the-fly)
sname
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
  • Goal of optimization To find more efficient
    plans that compute the same answer.

(On-the-fly)
rating gt 5
with pipelining )
sidsid
(Use hash
Sailors
bid100
index do
not write
result to
temp)
Reserves
59
Optimizing Joins
  • Q(u,x) - R(u,v), S(v,w), T(w,x)
  • R S T
  • Many ways of doing a single join R S
  • Symmetric vs. asymmetric join operations
  • Nested join, hash join, double pipe-lined hash
    join etc.
  • Processing costs alone vs. processing transfer
    costs
  • Get R and S together vs, get R, get just the
    tuples of S that will join with R (semi-join)
  • Many orders in which to do the join
  • (R join S) join T
  • (S join R) join T
  • (T join S) join R etc.
  • All with different costs

60
Determining Join Order
  • In principle, we need to consider all possible
    join orderings
  • As the number of joins increases, the number of
    alternative plans grows rapidly we need to
    restrict the search space.
  • System-R consider only left-deep join trees.
  • Left-deep trees allow us to generate all fully
    pipelined plansIntermediate results not written
    to temporary files.
  • Not all left-deep trees are fully pipelined
    (e.g., SM join).

61
Query Optimization Process(simplified a bit)
  • Parse the SQL query into a logical tree
  • identify distinct blocks (corresponding to nested
    sub-queries or views).
  • Query rewrite phase
  • apply algebraic transformations to yield a
    cheaper plan.
  • Merge blocks and move predicates between blocks.
  • Optimize each block join ordering.
  • Complete the optimization select scheduling
    (pipelining strategy).

62
Cost Estimation
  • For each plan considered, must estimate cost
  • Must estimate cost of each operation in plan
    tree.
  • Depends on input cardinalities.
  • Must estimate size of result for each operation
    in tree!
  • Use information about the input relations.
  • For selections and joins, assume independence of
    predicates.
  • System R cost estimation approach.
  • Very inexact, but works ok in practice.
  • More sophisticated techniques known now.

63
Key Lessons in Optimization
  • There are many approaches and many details to
    consider in query optimization
  • Classic search/optimization problem!
  • Not completely solved yet!
  • Main points to take away are
  • Algebraic rules and their use in transformations
    of queries.
  • Deciding on join ordering System-R style
    (Selinger style) optimization.
  • Estimating cost of plans and sizes of
    intermediate results.

64
Concurrency Control
  • Concurrent execution of user programs key to
    good DBMS performance.
  • Disk accesses frequent, pretty slow
  • Keep the CPU working on several programs
    concurrently.
  • Interleaving actions of different programs
    trouble!
  • e.g., account-transfer print statement at same
    time
  • DBMS ensures such problems dont arise.
  • Users/programmers can pretend they are using a
    single-user system. (called Isolation)
  • Thank goodness! Dont have to program very,
    very carefully.

65
Transactions ACID Properties
  • Key concept is a transaction a sequence of
    database actions (reads/writes).
  • DBMS ensures atomicity (all-or-nothing property)
    even if system crashes in the middle of a Xact.
  • Each transaction, executed completely, must take
    the DB between consistent states or must not run
    at all.
  • DBMS ensures that concurrent transactions appear
    to run in isolation.
  • DBMS ensures durability of committed Xacts even
    if system crashes.
  • Note can specify simple integrity constraints on
    the data. The DBMS enforces these.
  • Beyond this, the DBMS does not understand the
    semantics of the data.
  • Ensuring that a single transaction (run alone)
    preserves consistency is largely the users
    responsibility!

66
Scheduling Concurrent Transactions
  • DBMS ensures that execution of T1, ... , Tn is
    equivalent to some serial execution T1 ... Tn.
  • Before reading/writing an object, a transaction
    requests a lock on the object, and waits till the
    DBMS gives it the lock. All locks are held
    until the end of the transaction. (Strict 2PL
    locking protocol.)
  • Idea If an action of Ti (say, writing X) affects
    Tj (which perhaps reads X), say Ti obtains the
    lock on X first so Tj is forced to wait until
    Ti completes.This effectively orders the
    transactions.
  • What if Tj already has a lock on Y and Ti
    later requests a lock on Y? (Deadlock!) Ti or Tj
    is aborted and restarted!

67
Ensuring Transaction Properites
  • DBMS ensures atomicity (all-or-nothing property)
    even if system crashes in the middle of a Xact.
  • DBMS ensures durability of committed Xacts even
    if system crashes.
  • Idea Keep a log (history) of all actions carried
    out by the DBMS while executing a set of Xacts
  • Before a change is made to the database, the
    corresponding log entry is forced to a safe
    location. (WAL protocol OS support for this is
    often inadequate.)
  • After a crash, the effects of partially executed
    transactions are undone using the log. Effects of
    committed transactions are redone using the log.
  • trickier than it sounds!

68
Web brings unwashed masses, unreliable medium as
well as dirty data to databases..
  • Web accessibility changes the user/data/medium
    profile significantly
  • from SQL gurus supporting financial data on
    dedicated DBMS to 2.1 keyword query instant
    gratification seekers working with
    dirty/inconsistent data over unreliable web.
  • Challenges
  • How does one support keyword queries in
    databases?
  • How does one support imprecise queries in
    databases?
  • How do we handle incompleteness/inconsistency in
    databases?
  • Does it make sense to focus on total response
    time minimization
  • As against a multi-objective cost/benefit
    optimization?

The DB community has embraced these challenges
--see Lowell Report
69
Specifying Structure The XML Standard
  • 11/18

70
Specifying Structured Text/Data XML
  • XML is the confluence of several factors
  • The Web needed a more declarative format for
    data, trying to describe the meaning of the data
  • Documents needed a mechanism for extended tags to
    mark structure
  • Database people needed a more flexible
    interchange format
  • Original expectation
  • The whole web would go to XML instead of HTML
  • Todays reality
  • Not so But XML is used all over under the
    covers

Differing Expectations Based on which Side you
came from
71
An XML Document Example
  • ltimdbgt
  • ltshow year1993gt
  • lttitlegtFugitive, Thelt/titlegt
  • ltreviewgt
  • ltsuntimesgt
  • ltreviewergtRoger
    Ebertlt/reviewergt gives ltratinggttwo thumbs
  • uplt/ratinggt! A fun action
    movie, Harrison Ford at his best.
  • lt/suntimesgt
  • lt/reviewgt
  • ltreviewgt
  • ltnytgtThe standard hollywood
    summer movie strikes back.lt/nytgt
  • lt/reviewgt
  • ltbox_officegt183,752,965lt/box_officegt
  • lt/showgt
  • ltshow year1994gt
  • lttitlegtX Files,Thelt/titlegt
  • ltseasonsgt4lt/seasonsgt
  • lt/showgt
  • lt/imdbgt

Mixed Content
Attribute
72
(No Transcript)
73
HTML vs. XML
  • lth1gt Bibliography lt/h1gt
  • ltpgt ltigt Foundations of Databases lt/igt
  • Abiteboul, Hull, Vianu
  • ltbrgt Addison Wesley, 1995
  • ltpgt ltigt Data on the Web lt/igt
  • Abiteoul, Buneman, Suciu
  • ltbrgt Morgan Kaufmann, 1999
  • ltbibliographygt
  • ltbookgt lttitlegt Foundations lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltauthorgt Hull lt/authorgt
  • ltauthorgt Vianu lt/authorgt
  • ltpublishergt Addison Wesley
    lt/publishergt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt
  • lt/bibliographygt

Self-describing -Schema info part of the
data -Good for data exchange (albeit
baroque for storage)
74
lth1gt Bibliography lt/h1gt ltpgt ltigt Foundations of
Databases lt/igt Abiteboul, Hull, Vianu
ltbrgt Addison Wesley, 1995 ltpgt ltigt Data on
the Web lt/igt Abiteoul, Buneman, Suciu
ltbrgt Morgan Kaufmann, 1999
ltbibliographygt ltbookgt lttitlegt Foundations
lt/titlegt ltauthorgt Abiteboul
lt/authorgt ltauthorgt Hull
lt/authorgt ltauthorgt Vianu
lt/authorgt ltpublishergt Addison
Wesley lt/publishergt ltyeargt 1995
lt/yeargt lt/bookgt lt/bibliographygt
HTML describes presentation
XML describes content
XSL (stylesheets) can be used to specify the
conversion
75
XML Terminology
  • tags book, title, author,
  • start tag ltbookgt, end tag lt/bookgt
  • elements ltbookgtltbookgt,ltauthorgtlt/authorgt
  • elements are nested
  • empty element ltredgtlt/redgt abbrv. ltred/gt
  • an XML document single root element

well formed XML document if it has matching tags
76
DOM Tree (Document-Object Model)
  • An XML document can be seen as a hierarchical tree

77
XML Order
  • If you see an XML file as a text file with tags,
    then order should matter
  • If you see an XML file as a self-describing
    version of (relational) data, then order
    shouldnt matter
  • Which should be the default?

78
More XML Attributes
  • ltbook price 55 currency USDgt
  • lttitlegt Foundations of Databases lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt

Attributes are single-valued --No
guidance on when to use them
79
More XML Oids and References
Object identifiers
  • ltperson ido555gt ltnamegt Jane lt/namegt lt/persongt
  • ltperson ido456gt ltnamegt Mary lt/namegt
  • ltchildren
    idrefo123 o555/gt
  • lt/persongt
  • ltperson ido123 mothero456gtltnamegtJohnlt/namegt
  • lt/persongt

oids and references in XML are just syntax
80
(No Transcript)
81
XML Meaning
82
XML ? machine accessible meaning
Jim Hendler
This is what a web-page in natural language
looks like for a machine (Unless it is in
Beijing.. ? )
83
XML ? machine accessible meaning
Jim Hendler
XML allows meaningful tags to be added toparts
of the text
84
XML ? machine accessible meaning
Jim Hendler
But to your machine, the tags look like
this.(assuming it is not in Athens)
85
XML ? machine accessible meaning
Jim Hendler
Schemas help.
lt CV gt
by relating common termsbetween documents
private
86
But other people use other schemas
Jim Hendler
Someone else has one like this.
87
But other people use other schemas
Jim Hendler
lt CV gt
which dont fit in
private
Moral There is still need for
ontology mapping.. ?either by fiat ?or by
learning
88
XML Meaning Summary
  • XML is a purely syntactic standard
  • Saying that something is in XML format is like
    saying something is in List or Table format
  • It is NOT like saying that something in
    English/C etc (all of which have specific
    semantics)
  • Tags in XML do not up front have any meaning
  • Tags can be overloaded with specific meaning
    through prior agreement or standardization
  • Such agreements/standardization are possible for
    specific sub-tasks (e.g. HTML for rendering) or
    specific sub-communities (e.g. ebXML etcsee next
    slide)
  • Tags meaning can be expressed by relating them
    to other tags
  • This is the usual knowledge representation way
    (meaning comes from inter-predicate relations).
    Semantic Web pushes this view.
  • You can also learn the relations through
    context/practice/usage etc. This is the sort of
    view taken by (semi-automated) schema-mapping
    techniques

89
XML Dialect pot pourri
  • Extensible Financial Reporting Markup Language
    (XFRML),
  • eXtensible Business Reporting Language (XBRL),
  • MusicXML,
  • Spacecraft Markup Language (SML),
  • Bank Internet Payment System (BIPS),
  • Bioinformatic Sequence Markup Language (BSML),
  • Biopolymer Markup Language (BIOML),
  • Open Catalog Format (OCF),
  • Chemical Markup Language (CML),
  • Electronic Business XML Initiative (ebXML),
  • Open Trading Protocol (OTP),
  • FinXML, Financial Information eXchange protocol
    (FIX),
  • RecipeML, CVML,
  • XML Bookmark Exchange Language (XBEL),
  • Scalable Vector Graphics (SVG),
  • NewsML,
  • DocBook,
  • Real Estate Listing Markup Language (RELML), . . .

Examples of communities that Standardized their
tags
90
Who puts everything into XML?
  • To a certain extent, this a vaccuous question,
    once we realize that XML is just a syntactic
    standard
  • You can put things into XML by just putting
    ltbodygt tag (or any tag) at the beginning and end
    of the file
  • XML is not meant to be an imposition but rather a
    facilitator
  • XML facilitates marking up structure if someone
    wants to do this. That someone can be
  • creator of the page
  • secondary user who wants to tag the page
  • An extraction program that wants to remember the
    structure it extracted by tagging the page
  • The markup tags may or may not have any specific
    meaning based on prior agreements/standardization

91
Why are IR folks excited about XML?
  • XML files are text files with structure
  • Structure easily identifiable (the DOM structure)
  • We can improve Precision/Recall by taking
    structure into account..
  • We already did a bite.g. higher weight to words
    occuring in the header tags..

92
Why are Database folks excited about XML?
  • XML is just a syntax for (self-describing) data
  • This is still exciting because
  • No standard syntax for relational data
  • With XML, we can
  • Translate any legacy data to XML
  • Can exchange data in XML format
  • Ship over the web, input to any application
  • Talk about querying on seim-structured data

93
XML viewed from a Database Point of View
94
XML vs. Relational Data
  • XML is meant as a language that supports both
    Text and Structured Data
  • Conflicting demands...
  • XML supports semi-structured data
  • In essence, the schema can be union of multiple
    schemas
  • Easy to represent books with or without prices,
    books with any number of authors etc.
  • XML supports free mixing of text and data
  • using the PCDATA type
  • XML is ordered (while relational data is
    unordered)

95
XML Data Model (DOM)
imdb
show
title
review
review
_at_year
Fugitive, The
1993
suntimes
nyt

rating
reviewer
two...
gives
Roger Ebert
  • Check http//www.w3.org/XML/ for more details

96
DTDs
Notice that DTD is not In XML syntax ?
lt!DOCTYPE paper lt!ELEMENT paper
(section)gt lt!ELEMENT section ((title,section)
text)gt lt!ELEMENT title (PCDATA)gt
lt!ELEMENT text (PCDATA)gt gt
Semi- structured
ltpapergt ltsectiongt lttextgt lt/textgt lt/sectiongt
ltsectiongt lttitlegt lt/titlegt ltsectiongt
lt/sectiongt
ltsectiongt lt/sectiongt
lt/sectiongt lt/papergt
97
XML Schema
  • Supersedes DTD (and has XML syntax)
  • unifies previous schema proposals
  • generalizes DTDs
  • uses XML syntax
  • two documents structure and datatypes
  • http//www.w3.org/TR/xmlschema-1
  • http//www.w3.org/TR/xmlschema-2

98
XML Schema
99
RDF Meta-data Standard for Web
  • ltrdfDescription aboutwww.mypage.comgt
  • ltaboutgt birds, butterflies, snakes
    lt/aboutgt
  • ltauthorgt ltrdfDescriptiongt
  • ltfirstnamegt John
    lt/firstnamegt
  • ltlastnamegt Smith
    lt/lastnamegt
  • lt/rdfDescriptiongt
  • lt/authorgt
  • lt/rdfDescriptiongt

Goodol semantic networks..?
100
Xquery Resources
  • XQuery 1.0 An XML Query Language
  • W3C Working Draft 20 December 2001
  • XML Query Use Cases
  • W3C Working Draft 20 December 2001
  • Microsoft .Net Xquery Language Demo
  • http//131.107.228.20/
  • http//support.x-hive.com/xquery/index.html
  • Supports querying on the documents described in
    the W3C Use Cases
  • Xquery Tutorial by Fankhauser Wadler
  • www.research.avayalabs.com/user/wadler/papers/xque
    ry-tutorial/ xquery-tutorial.pdf

101
http//support.x-hive.com/xquery/index.html
You will be asked to play with it in homework
3
102
FLoWeR Expressions
  • Xquery queries are made up of FLWR expressions
    that work on paths
  • For binds variables to nodes
  • Let computes aggregates
  • Where applies a formula to find matching elements
  • Return constructs the output elements
  • Path expressions are of the form
  • element//element/elementattribvalue

103
Comparison to SQL
  • Look at the use case description on Xquery manual
  • Supports all (?) SQL style queries (with
    different syntax of course) default queries in
    the demo
  • Has support for
  • constructionoutputting the answers in
    arbitrary XML formats (use case XMP )
  • path expressions --- navigating the XML tree
    (use case seq)
  • Simple text queries use case text
  • Allows queries on Tag elements
  • Removes the data/meta-data barrier in queries
  • For each book that has at least one author, list
    the title and first two authors, and an empty
    "et-al" element if the book has additional
    authors. XMP use case 6

104
11/20
Make-up Class Wed 26th 1030AMRoom TBD
(probably 210)
  • XQuery IR-style search on XML Semantic Web
    standards

105
DTD for http//www.bn.com/bib.xml
  • lt!ELEMENT bib (book )gt
  • lt!ELEMENT book (title, (author editor ),
    publisher, price )gt
  • lt!ATTLIST book year CDATA REQUIRED gt
  • lt!ELEMENT author (last, first )gt
  • lt!ELEMENT editor (last, first, affiliation )gt
  • lt!ELEMENT title (PCDATA )gt
  • lt!ELEMENT last (PCDATA )gt
  • lt!ELEMENT first (PCDATA )gt
  • lt!ELEMENT affiliation (PCDATA )gt
  • lt!ELEMENT publisher (PCDATA )gt
  • lt!ELEMENT price (PCDATA )gt

106
Example Query
Query
Result
  • ltbibgt
  • for b in /bib/book
  • where b/publisher "Addison-Wesley"
  • and b/_at_year gt 1991
  • return ltbook year b/_at_year gt
  • b/title
  • lt/bookgt
  • lt/bibgt
  • For all books after 1991,
  • return with Year changed from
  • a tag to an attribute

ltbibgt ltbook year"1994"gt lttitlegtTCP/IP
Illustratedlt/titlegt lt/bookgt ltbook
year"1992"gt lttitlegtAdvanced Programming in
the Unix environmentlt/titlegt lt/bookgt lt/bibgt
107
Example Query (2)
  • Return the books that cost more at amazon than
    fatbrain
  • Let amazon document(http//www.amazon.com/book
    s.xml),
  • Let fatbrain document(http//www.fatbrain.com/
    books.xml)
  • For am in amazon/books/book,
  • fat in fatbrain/books/book
  • Where am/isbn fat/isbn
  • and am/price gt fat/price
  • Return ltbookgt am/title, am/price, fat/price
    ltbookgt

Join
108
XML frenzy in the DB Community
  • Now that XML is there, what can we do with it?
  • Convert all databases from Relational to XML?
  • Or provide XML views of relational databases?
  • Develop theory of native XML databases?
  • Or assume that XML data will be stored in
    relational databases..
  • Issues What sort of storage mechanisms? What
    sort of indices?

109
XML middleware for Databases
RDBMS
On the internet, nobody needs to know that you
are a dog
  • XML adapters (middle-ware) received significant
    attention in DB community
  • SilkRoute (ATT)
  • Xperanto (IBM)
  • Issues
  • Need to convert relational data into XML
  • Tagging (easy)
  • Need to convert Xquery queries into equivalent
    SQL queries
  • Trickier as Xquery supports schema querying

110
IR Style Querying of XML Documents
111
From Manning et al IR Text
An XML document is represented as a vector in
the space of Lexical Trees Query is an
extended lexical tree Similarity between
Query Lexical tree defined as follows
Within the document, you return the snippet that
is closest..
Note that we are increasing the size of the index
(lexical trees rather than just words), to
exploit Structure. This is normal (i.e., index
becomes larger when structure is present)
112
Semantic Web StandardsRDF/RDF-Schema/OWL
113
Syntax vs. Semantics
  • Syntax provides the grammar for a language (all
    you can do is to see whether a sentence is
    grammatically correct and do parts of speech
    tagging
  • XML
  • Semantics provides the set of worlds where a
    particular sentence (or a set of sentences) hold
  • Many formal languages have well-defined semantics
    (Propositional logic first order logic etc.)
  • Semantic Web involves providing an XML syntax for
    representing description logicsa fragment of
    First order logic
  • Has two parts Base facts are represented by RDF
    standard
  • Background Knowledge (axioms etc.)are
    represented by RDF-Schema (which is superseded
    now by OWL)

114
XML isnt enough for Knowledge Exchange..
  • XML is a universal metalanguage for defining
    markup
  • It provides a uniform framework for interchange
    of data and metadata between applications
  • However, XML does not provide any means of
    talking about the semantics (meaning) of data
  • E.g., there is no intended meaning associated
    with the nesting of tags
  • It is up to each application to interpret the
    nesting.

115
Nesting of Tags in XML
  • David Billington is a lecturer of Discrete Maths
  • ltcourse name"Discrete Maths"gt
  • ltlecturergtDavid Billingtonlt/lecturergt
  • lt/coursegt
  • ltlecturer name"David Billington"gt
  • ltteachesgtDiscrete Mathslt/teachesgt
  • lt/lecturergt
  • Opposite nesting, same information!

116
What we want is a standard for representing
knowledge on the web..
  • A standard technique for KR is Logic
  • So how about we find a way of encoding Logical
    statements in XML?
  • A logical theory consists of
  • Base facts
  • Background theory
  • RDF is a standard for writing (binary predicate)
    base-facts
  • E.g. parent(Tom,Mary)
  • RDF-Schema is a standard for writing background
    theory..
  • E.g. Forallx,y Parent(x,y)gtLoves(x,y)
  • Recall that the complexity of inference depends
    on the form of background theory (e.g.
    semi-decidable for general FOPC and polynomial
    for Horn clause. It is also tractable for
    description logics where all the background
    knowledge is of the form class, sub-class,
    instance. This is what RDF-Schema tries to
    capture)
  • RQL is (an emerging?) standard for querying
    RDF/RDF-S databases

117
Basic Ideas of RDF
  • Basic building block object-attribute-value
    triple
  • It is called a statement
  • Sentence about Billington is such a statement
  • RDF has been given a syntax in XML
  • This syntax inherits the benefits of XML
  • Other syntactic representations of RDF possible

118
Web Schema Languages
  • Existing Web languages extended to facilitate
    content description
  • XML ? XML Schema (XMLS)
  • RDF ? RDF Schema (RDFS)
  • XMLS not an ontology language
  • Changes format of DTDs (document schemas) to be
    XML
  • Adds an extensible type hierarchy
  • Integers, Strings, etc.
  • Can define sub-types, e.g., positive integers
  • RDFS is recognisable as an ontology language
  • Classes and properties
  • Sub/super-classes (and properties)
  • Range and domain (of properties)

119
RDF and RDFS
  • RDF stands for Resource Description Framework
  • It is a W3C candidate recommendation
    (http//www.w3.org/RDF)
  • RDF is graphical formalism ( XML syntax
    semantics)
  • for representing metadata
  • for describing the semantics of information in a
    machine- accessible way
  • RDFS extends RDF with schema vocabulary, e.g.
  • Class, Property
  • type, subClassOf, subPropertyOf
  • range, domain

120
The RDF Data Model
  • Statements are ltsubject, predicate, objectgt
    triples
  • Can be represented using XML serialisation, e.g.
  • ltIan,hasColleague,Uligt
  • Statements describe properties of resources
  • A resource is a URI representing a (class of)
    object(s)
  • a document, a picture, a paragraph on the Web
  • http//www.cs.man.ac.uk/index.html
  • a book in the library, a real person (?)
  • isbn//5031-4444-3333
  • Properties themselves are also resources (URIs)

121
URIs
  • URI Uniform Resource Identifier
  • "The generic set of all names/addresses that are
    short strings that refer to resources
  • URIs may or may not be dereferencable
  • URLs (Uniform Resource Locators) are a particular
    type of URI, used for resources that can be
    accessed on the WWW (e.g., web pages)
  • In RDF, URIs typically look like normal URLs,
    often with fragment identifiers to point at
    specific parts of a document
  • http//www.somedomain.com/some/path/to/filefragme
    ntID

122
RDF Syntax
  • RDF has an XML syntax that has a specific
    meaning
  • Every Description element describes a resource
  • Every attribute or nested element inside a
    Description is a property of that Resource with
    an associated object resource
  • Resources are referred to using URIs
  • ltDescription about"some.uri/person/ian_horrocks"
    gt
  • lthasColleague resource"some.uri/person/uli_sa
    ttler"/gt
  • lt/Descriptiongt
  • ltDescription about"some.uri/person/uli_sattler"gt
  • lthasHomePagegthttp//www.cs.mam.ac.uk/sattlerlt
    /hasHomePagegt
  • lt/Descriptiongt
  • ltDescription about"some.uri/person/carole_goble"
    gt
  • lthasColleague resource"some.uri/person/uli_sa
    ttler"/gt
  • lt/Descriptiongt

123
Linking Statements
  • The subject of one statement can be the object of
    another
  • Such collections of statements form a directed,
    labeled graph
  • Note that the object of a triple can also be a
    literal (a string)
  • Note also that RDF triples dont by themselves
    give meaning
  • You know that (1) Ian and Carol are most likely
    colleagues (barring multiple jobs for Uli (2)
    (Uli hasCollegue Ian) holds (colleagueness
    unlike love is symmetric). But DOES YOUR
    PROGRAM KNOW THIS?

124
A Critical View of RDF Binary Predicates
  • RDF uses only binary properties
  • This is a restriction because often we use
    predicates with more than 2 arguments
  • But binary predicates can simulate these
  • Example referee(X,Y,Z)
  • X is the referee in a chess game between players
    Y and Z

125
A Critical View of RDF Binary Predicates (2)
  • We introduce
  • a new auxiliary resource chessGame
  • the binary predicates ref, player1, and player2
  • We can represent referee(X,Y,Z) as

126
A Critical View of RDF Properties
  • Properties are special kinds of resources
  • Properties can be used as the object in an
    object-attribute-value triple (statement)
  • They are defined independent of resources
  • This possibility offers flexibility
  • But it is unusual for modelling languages and OO
    programming languages
  • It can be confusing for modellers

127
A Critical View of RDF Reification
  • The reification mechanism is quite powerful
  • It appears misplaced in a simple language like
    RDF
  • Making statements about statements introduces a
    level of complexity that is not necessary for a
    basic layer of the Semantic Web
  • Instead, it would have appeared more natural to
    include it in more powerful layers, which provide
    richer representational capabilities

128
A Critical View of RDF Summary
  • RDF has its idiosyncrasies and is not an optimal
    modeling language but
  • It is already a de facto standard
  • It has sufficient expressive power
  • At least as for more layers to build on top
  • Using RDF offers the benefit that information
    maps unambiguously to a model

129
RDF Schema (RDFS)
  • RDF gives a formalism for meta data annotation,
    and a way to write it down in XML, but it does
    not give any special meaning to vocabulary such
    as subClassOf or type
  • Interpretation is an arbitrary binary relation
  • I.e., ltPerson,subClassOf,Animalgt has no special
    meaning
  • RDF Schema defines schema vocabulary that
    supports definition of ontologies
  • gives extra meaning to particular RDF
    predicates and resources (such as subClasOf)
  • this extra meaning, or semantics, specifies how
    a term should be interpreted

NOTICE THAT RDF-SCHEMA is NOT to RDF WHAT
XML-Schema is to XML
130
Background Theory
RDF Schema is really RDF background knowledge!
Instances
131
RDF/RDFS vs. General Knowledge Rep Reasoning
  • We noted that RDF can be seen as base level
    facts and RDFS can be seen as background
    theory/facts/rules
  • At this level, inference with RDF/RDFS seems to
    be just a special case of Knowledge
    Representation Reasoning
  • This is good (CSE471 Ahoy!) and bad (reasoning
    over most non-trivial logics is NP-hard or much
    much worse).
  • RDF/RDFS can be seen as an attempt to li
Write a Comment
User Comments (0)
About PowerShow.com