Some of my XML/Internet Research Projects - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Some of my XML/Internet Research Projects

Description:

Title: XML: An Overview Author: K Yue Last modified by: K Yue Created Date: 9/10/2005 3:51:59 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 56
Provided by: KYu59
Learn more at: http://dcm.uhcl.edu
Category:

less

Transcript and Presenter's Notes

Title: Some of my XML/Internet Research Projects


1
Some of my XML/Internet Research Projects
  • CSCI 6530
  • October 5, 2005
  • Kwok-Bun Yue
  • University of Houston-Clear Lake

2
Content
  • Areas of My Research Interest
  • Some Current Projects
  • Storage of XML in Relational Database
  • Example Internet Computing Projects
  • Conclusions

3
Areas of My Research Interest
  • Internet Computing
  • XML
  • Databases
  • Concurrent Programming

4
Content
  • Areas of My Research Interest
  • Some Current Projects
  • Storage of XML in Relational Database
  • Example Internet Computing Projects
  • Conclusions

5
Some Current Projects
  • Storage of XML in relational database
  • Measuring Web bias using authorities and hubs
  • Measuring information quality of Web pages
  • Distributed computer security laboratory
  • Collaborative Open Community for developing
    educational resources
  • Generalized exchanges within organizations

6
Some Recent Student Work
  • McDowell, A., Schmidt, C. Yue, K., Analysis and
    Metrics of XML Schema, Proceedings of the 2004
    International Conference on Software Engineering
    Research and Practice, pp 538-544, Las Vegas,
    June 2004.
  • Yang A., Yue K., Liaw K., Collins G., Venkatraman
    J., Achar S., Sadasivam K., and Chen P.,
    Distributed Computer Security Lab and Projects,
    Journal of Computing Sciences in Colleges. Volume
    20, Issue 1. October 2004.
  • Yue, K., Alakappan, S. and Cheung, W., A
    Framework of Inlining Algorithms for Mapping DTDs
    to Relational Schemas, Technical Report
    COMP-05-005, Computer Science Department, the
    Hong Kong Baptist University, 2005,
    http//www.comp.hkbu.edu.hk/en/research/?contentt
    ech-reports.

7
Content
  • Areas of My Research Interest
  • Some Current Projects
  • Storage of XML in Relational Database
  • Example Internet Computing Projects
  • Conclusions

8
Storing XML in RDB
  • Advantages
  • Mature database technologies.
  • May be queried by
  • XML technology e.g. XPath, XQuery.
  • RDB technology e.g. SQL.
  • Disadvantages
  • impedance mismatch XML and relations are
    different data models.

9
Related Issues
  • Effective mapping XML DTDs ( ordered tree model)
    to relational schemas.
  • Mapping of XML queries (e.g. XQuery) to RDB
    queries (e.g. SQL).
  • Mapping of RDB query results back to XML format.

10
Related Work and Context
  • Mapping
  • With or without schemas for XML.
  • With or without user input.
  • Schemas for XML
  • Document Type Definition (DTD)
  • XML Schema
  • We consider mapping with DTD and without user
    input.

11
Naïve Mapping
  • An XML element is mapped to a relation.
  • Example 1a
  • XML ltagtltbgtltcgtltdgthellolt/dgtlt/cgtlt/bgtlt/agt
  • -gt Relations a, b, c and d.

12
Problems of Naïve Mapping
  • Many relations.
  • Ineffective queries multiple query joins.
  • Example 1b
  • XPath Query //a
  • SQL Query need to join the relations a, b, c and
    d.

13
Inlining Algorithms
  • First proposed by Shanmugasundaram, et. al.
  • Expanded by Lu, Lee, Chu and others.
  • Extended in various directions by various
    researchers, e.g.,
  • Preserving XML element orders.
  • Preserving XML constraints.
  • Do not consider extensions here.

14
Basic Idea of Inlining Algorithms
  • Inline child element into the relation for the
    parent element when appropriate.
  • Different inlining algorithms differ in inlining
    criteria.
  • Example 1c XML ltagtltbgtltcgtltdgthellolt/dgtlt/cgtlt/bgtlt/agt
  • Inlined Relation a.

15
Inlining Algorithms
  • Child elements attributes may be inlined.
  • Child elements may not have their own relations.
  • Results in less number of relations.
  • In general, more inlining -gt less joins.

16
Inlining Algorithm Structure
  1. Simplification of DTD.
  2. Generation of DTD graphs
  3. Generation of Relational Schemas

17
Our Preliminary Results
  1. A more complete and optimal DTD Simplification
    Algorithm
  2. A generic DTD Graph that can be used by inlining
    algorithms.
  3. Inlining Considerations framework for analyzing
    inlining algorithm
  4. A new and aggressive inlining algorithm

18
Examples of Our Work
  • Use DTD Simplification as an example of the
    flavor of our work.
  • Show the new Inlining Algorithm.

19
Brief Introduction to DTD
  • DTD a simple language to describe XML
    vocabulary
  • Element declarations contents of elements.
  • Attribute declarations types and properties of
    attributes.
  • DTD is still very popular.

20
DTD Element Declarations
  • Define element contents
  • PCDATA string
  • ANY anything go
  • EMPTY no content (attributes only)
  • Content models child elements.
  • Mixed contents child elements and strings.

21
DTD Example
  • Example 2 A complete DTD
  • lt!ELEMENT addressBook (person)gt
  • lt!ELEMENT person (name,email)gt
  • lt!ELEMENT name (last,first)gt
  • lt!ELEMENT first (PCDATA)gt
  • lt!ELEMENT last (PCDATA)gt
  • lt!ELEMENT email (PCDATA)gt
  • lt!ATTLIST person id ID REQUIREDgt

22
Operators for Element Declaration
  • , sequence
  • 1 or more
  • 0 or more
  • ? optional 0 or 1
  • choice
  • () parenthesis

23
Simplification of DTD
  • Mapping of DTD to Relational Schemas
  • Input DTDs
  • Output Relational Schemas
  • DTD can be complicated gt simplification.
  • Example 3
  • lt!ELEMENT a (b,((b,c)(d,b,c?)),(e,f)?)gt

24
Simplification Principles
  • The relational schema needs to store all possible
    scenarios.
  • Some relations/columns may not be populated in
    some instances.
  • Example 3
  • lt!ELEMENT a (bc)gt and
  • lt!ELEMENT a (b,c)gt
  • May be the same from the RDBs point of view.

25
Simplification Details
  • Comma-separated clauses only operators remain
    (), , and .
  • -gt , e.g. a -gt a.
  • Removal of and ?, e.g. (ab?) -gt (a,b)
  • Removal of (), e.g. (a, (b)) -gt (a,b)
  • Removal of repetition, e.g. (a, b, a) -gt (a, b)
  • Note that element orders are not preserved.

26
Previous Simplification Results
  • Not complete e.g.
  • Shanmugasundaram not specify how to handle .
  • Lu not specify how to remove ().
  • Not optimal (may generate when it is not
    needed).
  • Example 4a For Lu and Lee, 2 steps
  • (b(b,c)) -gt (b,b,c) -gt (b,c)

27
Our Simplification Algorithm
  • A set of definitions.
  • A set of 7 simplification rules.
  • An algorithm on how and when to use them.
  • Example 4b For us, 1 step
  • (b(b,c)) -gt (b,c)

28
Simplification Rules
29
Simplification Algorithm
30
Complexity
  • Time complexity O(Nop)
  • Where Nop is the total number of operators
    (including parentheses) in the element
    declarations of the DTD.

31
Advantages
  • Complete handle all DTDs.
  • Optimal in the sense that will not be
    generated if not needed.
  • Example 5
  • lt!ELEMENT a (b,((b,c)(d,b,c?)),(e,f)?)gt
  • gt lt!ELEMENT a (b,c,d,e,f)gt

32
A New Inlining Algorithm (1)
  • Aggressive in inlining.
  • More complete.
  • Elaborated algorithms.
  • Handle more details e.g. element types of ANY,
    EMPTY and mixed contents.

33
A New Inlining Algorithm (2)
34
A New Inlining Algorithm (3)
35
A New Inlining Algorithm (4)
36
Main Results
  • Yue, K., Alakappan, S. and Cheung, W., A
    Framework of Inlining Algorithms for Mapping DTDs
    to Relational Schemas, Technical Report
    COMP-05-005, Computer Science Department, the
    Hong Kong Baptist University, 2005,
    http//www.comp.hkbu.edu.hk/en/research/?contentt
    ech-reports.

37
Future Works
  • Implemented the algorithms and tested with many
    DTDs.
  • Need to implement the XQuery/SQL bridge for
    performance study.

38
Content
  • Areas of My Research Interest
  • Some Current Projects
  • Storage of XML in Relational Database
  • Example Internet Computing Projects
  • Conclusions

39
Measuring Web Bias
  • Search engines dominate how information are
    accessed.
  • Search results have major social, political and
    commercial consequences.
  • Are search engines bias?
  • How bias are them?

40
Previous Works
  • To measure bias, results should be compared to a
    norm.
  • The norm may be from human experts.
  • Mowshowitz and Kawaguchi the average search
    result of a collection of popular search engines
    as the norm.

41
Mowshowitz and Kawaguchi
42
Limitations
  • Based on URL Vector -gt cannot measure bias
    quality.

43
Our Approach
  • Use Kleinbergs HITS algorithm to create
    clusters, authorities and hubs of the result norm
    URLs.
  • Use them as norm clusters, authorities and hubs.
  • Measure distances between norms and individual
    results as bias.

44
HITS
  • Obtain a directed graph G where
  • Node page
  • Edge URL link from between pages.
  • Two indices xp,i (authority) yp,i (hub)
  • Iterate until steady state
  • xp,i1 lt- ? q,q-gtp yq,i
  • yp,i1 lt- ? q,p-gtq xq,i

45
Our Approach
46
Current Progress
  • Implemented previous results.
  • Implemented vector analysis
  • Implemented HITS algorithm, but it is not
    accurate enough
  • Conglomerate effect.

47
Measuring Pages Information Quality
  • People find information from Web pages.
  • How good is the content of a given page?

48
Previous Works
  • Measuring different kinds of quality
  • Web site design quality
  • Navigational quality
  • Many framework on how to measure information
    quality
  • Most results in surveys so users can rank
    informational quality.
  • Very few automated or semi-automated tool.

49
Our Objectives
  • Build automated and/or semi-automated tool to
    measure and/or assist user to measure information
    quality of a Web page.

50
Approach
  • Hypothesis, measure, usage guidelines.
  • Example
  • Hypothesis a Web page with many spelling
    mistakes is likely to have low information
    quality.
  • Measures
  • Show frequencies of word occurrences.
  • Show percentage of spelling mistakes.
  • Usage guideline
  • Spelling mistakes may not be actual mistakes
    (e.g. UHCL).

51
Metrics
  • Many potential metrics. Some examples
  • Broken links
  • HTML Quality
  • Domain names
  • Page ranking and popularity
  • Appearance in directory structure
  • History (e.g. Way back machine)
  • Currency (e.g. last modified)
  • Author (e.g. Meta tag)

52
Current Progress
  • Pre-alpha prototype http//dcm.cl.uh.edu/yue/ut
    il/pageInfo.pl
  • A capstone project

53
Content
  • Areas of My Research Interest.
  • Some Current Projects
  • Storage of XML in Relational Database
  • Example Internet Computing Projects
  • Conclusions

54
Conclusions
  • Good time to do applied computing research in the
    Web and XML areas.
  • Style hands-on supervision publications.
  • Don't forget to donate a scholarship to the
    School if your future research leads to a
    windfall.

55
Questions?
  • Any Questions?
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com