XML%20Research%20Issues%20in%20Database%20Perspective - PowerPoint PPT Presentation

About This Presentation
Title:

XML%20Research%20Issues%20in%20Database%20Perspective

Description:

Saturday, October 28 2000. XML Research Issues in Database Perspective - KISS'00 Fall ... In RDBMS, due to disassembly of XML data into various tables, implementing an ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 63
Provided by: kyuseo
Category:

less

Transcript and Presenter's Notes

Title: XML%20Research%20Issues%20in%20Database%20Perspective


1
XML Research Issues in Database Perspective
  • Kyuseok Shim
  • shim_at_cs.kaist.ac.kr
  • http//cs.kaist.ac.kr/shim
  • Korea Advanced Institute of Science and Technology

2
XML Working Groups
  • Core XML
  • XML, namespaces, XML Inforset
  • XML Linking
  • Xpath, Xpointer, Xlink
  • XML Schema
  • XML Schema
  • XML Query
  • XML Query, XML Query Data Model
  • Document Object Model (DOM)
  • XSL

3
XML
  • A W3C standard to complement HTML
  • An instance of semistructured data Abi97
  • Document Type Descriptor (DTD)
  • Origin SGML
  • Tags describe the semantics of the data
  • HTML simply specify how the data time is to be
    displayed
  • An element can contain a sequence of nested
    sub-elements
  • Sub-elements may themselves be tagged elements or
    character data

4
Document Type Definition (DTD)
  • A part of XML specification
  • An XML document may have a DTD
  • Grammar for describing the structure of XML
    document
  • The structure of an element is specified by a
    regular expression
  • Terminology for XML
  • well-formed if tags are correctly closed
  • valid if it has a DTD and conforms to it
  • For exchanges of data, validation is useful

5
Document Type Definition (DTD)
  • Syntax
  • comma sequence
  • or
  • () grouping
  • ?, , zero or one, zero or more, one or more
    occurrences
  • ANY allows an arbitrary XML fragment to be
    nested within the element

6
A DTD Example
  • lt!ENTITY USA United States of Americagt
  • lt!ELEMENT book (booktitle, author)gt
  • lt!ATTLIST book id ID IMPLIEDgt
  • lt!ELEMENT booktitle (PCDATA)gt
  • lt!ELEMENT author (name, (address affiliation))gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT address ANYgt
  • lt!ELEMENT affiliation (PCDATA)gt

7
An XML Document Example
  • ltbook id123gt
  • ltbooktitlegt The Selfish Gene lt/booktitlegt
  • ltauthor iddawkinsgt
  • ltnamegt Richard Dawkins lt/namegt
  • ltaddressgt
  • ltcitygt Timbuktu lt/citygt
  • ltzipgt 99999 lt/zipgt
  • lt/addressgt
  • lt/authorgt
  • lt/bookgt
  • ltbookgt
  • ltbooktitlegt The C Programming Languagelt/booktitle
    gt
  • ltauthorgt
  • ltnamegt Brian W. Kernighan lt/namegt
  • ltaddressgt ltcountrygt USA lt/countrygt lt/addressgt
  • lt/authorgt
  • ltauthorgt
  • ltnamegt Dennis M. Ritchie lt/namegt
  • ltaffiliationgt Bell Labs lt/affiliationgt

8
An XML Namespace
  • Provides a simple method for qualifying element
    and attribute names used in Extensible Markup
    Language documents by associating them with
    namespaces identified by URI references.
  • Is a collection of names, identified by a URI
    reference, which are used in XML documents as
    element types and attribute names.
  • ltx xmlnsedi'http//ecommerce.org/schema'gt
  •  lt!the 'price' element's namespace is http//ecom
    merce.org/schema --gt
  •   ltediprice units'Euro'gt32.18lt/edipricegt
  • lt/xgt

9
XML Schemas
  • Recently proposed
  • http//www.w3c.org/TR/xmlschema-1
  • http//www.w3c.org/TR/xmlschema-2
  • Unifies previous schema proposals
  • Generalizes DTDs
  • Use XML syntax

10
XML Schema
  • ltelementType name articlegt
  • ltsequencegt
  • ltelementTypeRef name title/gt
  • ltelementTypeRef name author
    minOccurs0/gt
  • lt/sequencegt
  • lt/elementTypegt
  • DTD lt!ELEMENT article (title, author)gt

11
XTRACT Extracting DTD from XML Documents
  • Garofalakis, Gionis, Rastogi, Seshadri, Shim 99
  • DTDs contain valuable information on the
    structure of the documents
  • play a critical role in the storage as well as
    formulation and optimization of queries
  • DTDs are not mandatory
  • it is frequently possible the XML database does
    not have accompanying DTDs
  • XTRACT can infer concise and semantically
    meaningful DTDs for XML documents

12
XTRACT Motivation
  • DTD is very useful!
  • Plays a crucial role in efficient storage of XML
    data
  • SHT99, DFS99 DTDT is exploited to generate
    effective relational schema
  • Devise efficient plans for queries
  • GW97, FS97 DTD allows to restrict the search
    only relevant portions of the data
  • Aids users to form meaningful queries over the
    XML database
  • However, XML document may not always have an
    accompanying DTD

13
XTRACT Related Work
  • Mining DTDs from a collection of XML documents
    has not been addressed in the literature
  • Extraction of schema from semistructured data
  • NAM98, GW97, FS97
  • attempts to find typing for semistructured data
  • finding a typing is tantamount to grouping
    objects that have similar edges
  • In DTD, outgoing edges from a type can be
    described by an arbitrary regular expression
  • No ordering is imposed for edges

14
XTRACT Related Work
  • Gol67, Gol78, Ang78
  • Infer formal languages from examples
  • Purely theoretical and focus on investigating the
    computational complexity of the language
    inference problem
  • KMU95
  • Infers a pattern language from positive examples
  • MDL principle was used
  • Assume the set of simple patterns is available
  • Cannot find general regular expressions
  • Patterns are not known apriori

15
XTRACT Problem Formulation
  • Given a set I of N input sequences nested within
    elements e
  • Compute a DTD for e such that every sequence in I
    conforms to the DTD

16
XTRACT Naive Approaches
  • Factor as much as possible
  • e.g. t, ta, taa, taaa, taaaa
  • t t (a a(a a(a aa)))
  • much more voluminous and a lot less intuitive
  • Find the automaton with the smallest number of
    states that accepts I and drive regular
    expressions from automaton
  • may not be the shortest regular expression

17
XTRACT Desirable DTDs
  • The DTD should be concise (i.e. small in size)
  • easy to understand and succinct
  • The DTD should be precise
  • not cover too many sequences not contained in I
  • not too general and captures the structure f
    input sequences
  • Trade-off!

18
XTRACT Example
  • I ab, abab, ababab
  • (a b)
  • a gross over-generalization of the input
  • completely fails to capture any structure
    inherent in input
  • ab abab ababab, ab ab(ab abab)
  • accurately reflect the structure of the input
    sequences but do not generalize
  • (ab)
  • succinct and generalizes the input sequence
    without loosing too much structure information

19
XTRACT MDL Principle
  • An information-theoretic measure for quantifying
    and thereby resolving the tradeoff between the
    conciseness and preciseness
  • MDL principle has been successfully applied in a
    variety of situations
  • e.g. decision tree classifiers
  • Roughly speaking, the best theory to infer from a
    set of data is the one that minimizes the sum of
  • the length of the theory, in bits (conciseness)
  • the length of the data, in bits, when encoded
    with the help of the theory (preciseness)

20
XTRACT Example
  • I ab, abab, ababab
  • (a b)
  • abab cost of 5 (the number of repetitions (4)
    4 characters to represent chosen character)
  • MDL cost 6 (encoding DTD) 3 5 7 21
  • ab abab ababab
  • MDL cost 14 3 17
  • ab ab(ab abab)
  • MDL cost 14 1 2 2 19
  • (ab)
  • MDL cost 5 3 8

21
XTRACT
  • Generalization
  • generalizes zero or more candidate DTDs by
    replacing patters in the input sequence with
    meta-characters like
  • e.g. abab gt (ab), bbbe gt be
  • Factorization
  • factors common subexpressions from the
    generalized candidate DTDs
  • e.g. bd be gt b (d e)
  • Minimum Description Length (MDL) Principle
  • MDL ranks each candidate DTD and chooses the
    minimum cost DTD

22
XTRACT Example
23
XML Storage
  • Existing approaches either sacrifice efficiency
    or flexibility unnecessary
  • Traditional DBMSs (RDB or OODB) have rigid
    schemas.
  • Integrating a new site requires complex mapping
    and potential loss of information
  • Integrating a new site may require schema
    evolution.
  • Existing fully semi-structured data storage
    techniques sacrifice query efficiency and space.
  • they require excessive interpretation (harming
    query efficiency) and
  • redundant storage

24
XML Storage
  • Need to store and query XML data flexibly and
    efficiently
  • improve the tradeoffs for storage space and
    query efficiency for a given degree of
    flexibility.
  • allows user to choose the degree of storage
    flexibility

25
XML Storage
  • text file
  • relational DBMS
  • object-oriented DBMS
  • build special purpose repository

26
XML Storage Text File
  • To store the flat streams, file system or a BLOB
    manager in DBMS is used
  • e.g. Abiteboul, Cluet, Milo VLDB93
  • Pros
  • simple
  • fast for storing and retrieving whole documents
  • less space than one think
  • reasonable clustering
  • Cons
  • incremental update is difficult
  • require special purpose query processor
  • accessing documents structure is only possible
    through parsing

27
XML Storage Relational DBMS
  • Advantages
  • RDBMS products are mature and scales well
  • Traditional and semi-structured data can co-exist
  • RDBMS can process even complex queries on large
    databases within seconds
  • Disadvantages
  • expensive to reconstruct the original XML data
    from relational data
  • updates are both complicated and expensive for a
    certain cases
  • extra efforts to translate XML queries and
    updates into SQL

28
XML Storage RDMBS (1)
  • Florescu, Kossmann IEEE Data Eng. Bulletin 99

29
XML Storage RDBMS (2)
  • Shanmugasundaram et al. 99
  • process DTD to generate a relational schema
  • Use DTD graph and element graph
  • three approaches
  • Basic
  • Shared
  • Hybrid

30
DTD
31
XML Document
32
The Basic Inline Technique
  • Creates relations for every element
  • an XML document can be rooted at any element in a
    DTD
  • element graph is used to decide the relations
  • Inlines as many descendants as possible
  • e.g. the author relation has attributes
    firstname, lastname, address and authorid
  • Creates a separate relation to handle in DTD
    graph using a foreign key
  • Expresses the recursive relationship using the
    notion of relational keys

33
Building an Element Graph
  • Do a depth first traversal of the DTD graph
    starting at the element node
  • Each node is marked as visited the first time
    reached
  • Each node is unmarked once all of its children
    have been traversed
  • If an unmarked node in DTD graph is reached, a
    new node with the same name is created in the
    element graph
  • If an attempt is made to traverse marked DTD
    node, backpointer edge is added

34
DTD Graph
35
An Example Element Graph
36
Creation of Relations
  • Given an element graph, relations are created as
    follows
  • A relation is produced for the root element
  • All descendent elements are inlined into that
    relation except
  • children directly below a node
  • each node having a backpointer edge pointing to
    it
  • A separate relation is created for each of the
    above exception node
  • Each relation has ID and parentID fields

37
Basic Inline Schema
38
Basic Inline Technique
  • Pros
  • List all authors of books
  • Cons
  • List all authors having first name Jack (5
    separate queries)
  • Large number of relations are created

39
Shared Inline Technique
  • Relations are created for all elements in the DTD
    graph whose nodes have in-degree greater than one
  • Nodes with an in-degree of one are inlined
  • Nodes with an in-degree of zero are made separate
    relations
  • Of mutually recursive elements all having
    in-degree one, one of them is made a separate
    relation
  • e.g. monograph and editor

40
Shared Inlining Technique
  • Small number of relations compared to Basic
    schema
  • Use isRoot field for inlining problems
  • Requires only one query for finding all authors
  • Still Basic is superior for reducing the number
    of joins

41
Shared Inlining Technique
  • Additionally inlines elements with in-degree
    greater than one that are not recursive or
    reached through a node
  • e.g. author is inlined with book and monograph

42
XML Storage STORED
  • Deutsch, Fernandez, Suciu SIGMOD99
  • Semistructured data into relational data
  • Integrate both relational and overflow systems
  • Use data mining algorithm to find out frequent
    subtrees
  • due to the fact that there is no notion of DTD in
    semistructured data
  • Overflow mapping is used to insure lossless
  • overflow objects or object parts are stored in a
    separate semistructured data object repository
  • Incremental updates and ordering of elements are
    not considered

43
XML Storage STORED
  • Derive schema from data with data mining algorithm

44
XML Storage OODBMS
  • Stores XML elements with the structured semantics
  • Flexible locking down to element level
  • In RDBMS, due to disassembly of XML data into
    various tables, implementing an effective locking
    scheme is hard
  • In using flat file, no portion of a document
    being modified is available to other users
  • Use a separate record for each tree node
  • Systems available
  • POET (POET Content Management system)
  • Excelon (ObjectDesign)
  • LORE

45
XML Storage NATIX
  • Kanne, Moerkotte ICDE00
  • Native repository
  • Classical record manager
  • Accesses raw disk or file system files
  • Provides a memory space divided into segments
    (equal sized pages)
  • Tree storage manager
  • maps treed used to model documents
  • Schema manager
  • maintains the system catalog data (e.g. DTD)
  • system catalog is stored in XML format

46
NATIX
  • Store whole document in one record, instead of
    storing each tree node in a separate record
  • Semantically split large tree based on underlying
    tree structure
  • Partition the data into subtrees and store each
    subtree in a record
  • Connected subtrees residing in other records are
    represented by proxy objects
  • proxy objects consist of RID
  • substituting all proxies by the respective
    subtrees reconstruct the original data tree

47
XML Query Processing
  • McHugh, Widom Workshop 99
  • Expand regular path expressions at compile time
    using structural summary
  • Guarantee to visit, at run-time, a subset of the
    objects visited with the original path expression
  • e.g. Library. -
  • Proceedings.Conference.Paper
  • Books.Book
  • Movies.Movie.BasedOn

48
XML Query Processing
  • Fernandez, Suciu ICDE 98
  • Optimize regular path expressions
  • Restrict navigation to only a fragment of the
    data
  • Use state extents to eliminate and reduce
    navigation
  • McHugh, Widom VLDB 99
  • Propose cost-based query optimizer
  • Transform a query into a logical query plan
  • Explore the space of possible physical plans
  • Introduce new types of indexes for efficient
    traversals through data graphs
  • Suggest an appropriate set of statistics and
    devise methods for computing and storing
    statistics

49
XML Query Processing
  • Christophides, Cluet, Simeon SIGMOD 00
  • Propose an XML algebra
  • Captures the expressive power of semistructured
    or XML query languages
  • Can wrap more structures languages such as SQL or
    OQL
  • New optimization techniques
  • Exploit type information
  • Push query evaluation to external source

50
XML View of Relational Data
  • Fernandez, Tan, Suciu WWW 00
  • Mediator system
  • Automatically convert the relational data into
    XML
  • An XML view of the relational database is defined
    using a declarative query language
  • Some other application formulates a query over
    the virtual view
  • Exploit fully underlying RDBMS query engine

51
XML View of Relational Data
  • Shanmugasundaram et al. VLDB 00
  • Propose to use new scalar and aggregate in SQL to
    construct complex XML document
  • Explore different execution plans for generating
    the contents of XML documents
  • Construct XML document inside the relational
    engine benefits most for performance
  • Outer union plan

52
Metadata Management
  • Generic data model
  • Not impossible, but unlikely
  • Proliferation of data models
  • No proof anyone is superior
  • Semantics arent fully captured in any data model

53
Metadata Management
  • Philip Bernstein VLDB 00s Panel
  • Generality - representation of metadata must
    apply to all application areas
  • Usefulness exploit application-specific
    semantics
  • Is there an effective middle-ground?
  • Define generic high-level operations on models
    and mappings, e.g., Match, Merge, Select,
    Compose,
  • Match(M1, M2, ?, map), Merge(M1, M2, map),
    Compose(map1, map2)
  • Implement operations on a DBMS

54
Metadata Management
55
Metadata Management Clio
  • Miller, Haas, Hernandez VLDB 00
  • Tool to support mapping between data
    representations
  • Mapping represented as SQL
  • Heterogeneous query middleware to examine data
    and schemas
  • Build database competencies in query and schema
    management, data mining
  • Exploit user knowledge of target semantics
  • Enhance user knowledge of source schema and data
  • Provide knowledge of query subtleties,
    alternative mappings

56
Metadata Management Clio
  • User indicates what schema and data values are
    needed for target
  • Tool enumerates and ranks mappings
  • Many possible subtle differences
  • Best mappings are simple, but lose least
    information possible
  • Allows immediate user feedback

57
Filtering XML Documents
  • Altinel, Franklin VLDB 00
  • Xfilter provides highly efficient matching of
    XML documents to user profiles
  • Event filtering system
  • Highly scalable
  • Use XPath as a profile language

58
XML Data Compression
  • Liefke, Suciu SIGMOD 00
  • Structure, consisting of tags and attributes, is
    compressed separately
  • Group related data items and compress each
    related group separately
  • Apply semantic compression
  • Automatic data mining tools to cluster data needs
    to be developed

59
Future Research Issues
  • XML views of traditional databases
  • Relational database
  • Object-relational database
  • XML Storage
  • Object-relational databases
  • Alternative storage methods
  • Indexes for XML data
  • XML query processing and optimization
  • Centralized and distributed processing

60
Future Research Issues
  • Schema mapping
  • Mixing structure search with full-text search
  • XML-based mediators
  • XML data compression

61
Summary
  • XML provides a lot of challenges to database
    community
  • XML Storage Issues
  • XML Indexes
  • DTD Extraction
  • Query language
  • Query processing
  • Metadata Management

62
Biography of Kyuseok Shim
  • Kyuseok Shim is an Assistant Professor in
    Computer Science Department at KAIST in Korea. He
    is also currently an Advisory Committee Member
    for ACM SIGKDD. Before joining KAIST, he was a
    member of technical staff (MTS) in the Database
    Systems Research Department at Bell Laboratories.
    While he was in Bell Laboratories, he started and
    worked for Serendip data mining project and
    eXcalibur XML storage project. Before joining
    Bell Laboratories, he worked for Rakesh Agrawal's
    Quest data mining project at IBM Almaden Research
    Center. He also worked with Surajit Chaudhuri as
    a summer intern for two summers at Hewlett
    Packard Laboratories. He received B.S. degree in
    Electrical Engineering from Seoul National
    University, and the MS and Ph.D. degree in
    Computer Science from University of Maryland,
    College Park.
  • Kyuseok Shim has been working in the area
    of databases focusing on XML, data mining, data
    warehousing, query processing and query
    optimization. He has published more than 30
    research papers in prestigious international
    conferences and journals. He has also served as a
    program committee member on several international
    conferences including ICDE'97, SIGKDD'98,
    SIGMOD'99, SIGKDD'99, ICDE'00, VLDB'00 and
    SIGKDD01.
Write a Comment
User Comments (0)
About PowerShow.com