SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE

Description:

No loss of information while shredding. ... Design framework. A master ... site of the School of Computer Science and Engineering, University of Washington. ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 22
Provided by: ibrahi4
Category:

less

Transcript and Presenter's Notes

Title: SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE


1
SCHEMALESS APPROACH OF MAPPING XML
DOCUMENTS INTO RELATIONAL DATABASE
  • Ibrahim Dweib, Ayman Awadi,
  • Seif Elduola Fath Elrhman, Joan Lu

CIT 2008 Sydney, Australia 8-11 July 2008
2
Why schema-less
  • Many applications deal with highly flexible XML
    documents from different sources, which make it
    difficult to define their structure by a fixed
    schema or a DTD. Therefore, it is necessary for
    schema-less approaches to deal with such XML
    documents.

3
The method aims to overcome the challenges faced
due to fixed shredding
  • No loss of information while shredding.
  • Reconstruction of original XML documents is
    easier and much faster.
  • Maintaining XML document structure.
  • Preserve the ordering nature of XML data.

4
Theory guidance
  • The main mathematical concepts that are used in
    this method are
  • Definition 1
  • XML tree is composed of many sub-trees of
    different levels it can be define as the
    following
  • i1, 2 n, represent the levels of XML tree, 0
    represents the root
  • Where, Ei is a finite set of elements in the
    level i.
  • Ai is a finite set of attributes in the level i.
  • Xi is a finite set of texts in the level i.
  • ri-1 is the root of the sub-tree of level i.

5
Theory guidance (Cont)
  • Definition 2
  • A dynamic fragment (shred) df(i) is defined to be
    the attributes and texts (leaf children) of the
    sub-tree i of the XML tree plus its root ri-1,
    as follows
  • df(i) (Ai, Xi, ri-1),
  • Where Ai is a finite set of attributes in the
    level i
  • Xi is a finite set of texts in the level i.
  • ri-1 is the root of the sub-tree of level
    i.

6
Design framework
  • A master table for documents. Called "documents
    table, to keep information about documents
    themselves,
  • documents(doc_id, doc_structure, .. ),
  • Additional fields may be added to keep all
    information about the document itself such as
    dates, statistics, types etc.
  • The doc_id is a unique id generated per document
    to identify documents.
  • The doc_structure is a big text field containing
    a coded string describing each document
    structure, any changes on the document structure
    should be reflected in this field, such as adding
    a new tag or property, deleting an existing tag
    or property, or relocating a given tag or
    property to a different location in the same
    document

7
Design framework (Cont)
  • A second table to store the actual contents for
    all documents. Documents will be shredded into
    pieces of data that will be called tokens, each
    document element, tag, or property will be
    considered a token, the tokens table will have at
    the minimum this structure,
  • tokens(doc_id, token_id, token_name,
    token_value).
  • The token_id is the primary generated id for each
    token.
  • The doc_id is the foreign key linking the tokens
    table to the documents table.
  • token_name is the tag name or the property name
    as found in the original XML document.
  • token_value is the text value of the XML tag
    property.

8
Design framework, (Cont) doc_structure field
construction rules
  • The doc_structure field is where the document
    structure maintained.
  • It consists of long series of related keys.
  • Each key should start with a given alphabet
    character,
  • The letter 'T' for element (child), and the
    letter 'A' for attribute,
  • These letters are necessary to delimit keys in
    the sequence.
  • Then the letter is followed by a numeric number
    representing the token_id that this key is
    referring to,
  • Example T120 is a key referring to a token in
    the tokens table whose token_id 120.

9
Design framework, doc_structure field
construction rules (Cont)
  • If the token has properties then
  • the key representing this token in the
    doc_structure will be followed with a set of
    keys defining these properties.
  • Example T120A12A17A2 is a valid key string for
    token number 120 which has three properties
    defined by tokens number 12, 17, and 2.
  • These properties appear in the original document
    in this order.

10
Design framework, doc_structure field
construction rules (Cont)
  • If the token has some children tags then
  • these children will be represented as a
    key-string surrounded by angle brackets.
  • Example T120ltT12T7ltT2T1gtT77gt is a valid string
    that can be read, token 120 has three sub tags in
    this order token 12, followed by token 7, then
    token 77, and token 7 itself has also two sub
    tags 2, and 1 in the given order.

11
Theory implementation on simple case study
12
Theory implementation on simple case study
Figure 2 A tree representation for XML document
in figure 1
13
Theory implementation on simple case study
14
Theory implementation on simple case study
15
EXPERIMENTAL Environment
  • An Intel Core 2 Duo computer with 2 GHz CPU, 1 GB
    RAM, 256 MB shared Cache
  • OS Windows Vista home edition.
  • Visual Basic 6 is used as software development
    kit with Microsoft Access 2003 as relational
    database target.
  • Five XML documents with different sizes are used
    in the experiment.
  • The data is taken from the XML data repository
    that is available at the web site of the School
    of Computer Science and Engineering, University
    of Washington.
  • The performance metric is the time spent for
    mapping XML documents to relational database and
    the time spent for reconstructing these documents
    from relational database.
  • The experiment is repeated five times and the
    mean value of those times is reported to obtain a
    realistic and accurate results.

16
EXPERIMENTAL RESULTS
Table 1 The time spent for mapping XML documents
to RDBMS, and the time for reconstructing them
17
EXPERIMENTAL RESULTS
18
Conclusion (1)
  • By using this method
  • Maintaining document structure at a low cost
    price and easily,
  • Building the original document is straight
    forward,
  • Performing first level semantic search is also
    achievable either on a single document or on all
    documents.

19
Conclusion (2)
  • Method Limitation
  • Complex semantic search is not achievable easily
    in this structure.
  • Document size is limited to memory size since we
    use DOM based parsing

20
Future Works
  • Improving this method to achieve complex
    semantic search, differentiate between XML data
    type (i.e., strings, dates, integers), in order
    to apply less than or greater than queries.
  • Making an intensive testing and compare our
    method with other methods in the literature to
    see its performance.
  • Using SAX parsing for XML document to solve
    document size limitation.

21
Thank You for Your Time
Write a Comment
User Comments (0)
About PowerShow.com