Some of my XML/Internet Research Projects

About This Presentation

Title:

Some of my XML/Internet Research Projects

Description:

Title: XML: An Overview Author: K Yue Last modified by: K Yue Created Date: 9/10/2005 3:51:59 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 56

Provided by: KYu59

Learn more at: http://dcm.uhcl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Some of my XML/Internet Research Projects

1
Some of my XML/Internet Research Projects

CSCI 6530
October 5, 2005
Kwok-Bun Yue
University of Houston-Clear Lake

2
Content

Areas of My Research Interest
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions

3
Areas of My Research Interest

Internet Computing
XML
Databases
Concurrent Programming

4
Content

Areas of My Research Interest
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions

5
Some Current Projects

Storage of XML in relational database
Measuring Web bias using authorities and hubs
Measuring information quality of Web pages
Distributed computer security laboratory
Collaborative Open Community for developing
educational resources
Generalized exchanges within organizations

6
Some Recent Student Work

McDowell, A., Schmidt, C. Yue, K., Analysis and
Metrics of XML Schema, Proceedings of the 2004
International Conference on Software Engineering
Research and Practice, pp 538-544, Las Vegas,
June 2004.
Yang A., Yue K., Liaw K., Collins G., Venkatraman
J., Achar S., Sadasivam K., and Chen P.,
Distributed Computer Security Lab and Projects,
Journal of Computing Sciences in Colleges. Volume
20, Issue 1. October 2004.
Yue, K., Alakappan, S. and Cheung, W., A
Framework of Inlining Algorithms for Mapping DTDs
to Relational Schemas, Technical Report
COMP-05-005, Computer Science Department, the
Hong Kong Baptist University, 2005,
http//www.comp.hkbu.edu.hk/en/research/?contentt
ech-reports.

7
Content

Areas of My Research Interest
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions

8
Storing XML in RDB

Advantages
Mature database technologies.
May be queried by
XML technology e.g. XPath, XQuery.
RDB technology e.g. SQL.
Disadvantages
impedance mismatch XML and relations are
different data models.

9
Related Issues

Effective mapping XML DTDs ( ordered tree model)
to relational schemas.
Mapping of XML queries (e.g. XQuery) to RDB
queries (e.g. SQL).
Mapping of RDB query results back to XML format.

10
Related Work and Context

Mapping
With or without schemas for XML.
With or without user input.
Schemas for XML
Document Type Definition (DTD)
XML Schema
We consider mapping with DTD and without user
input.

11
Naïve Mapping

An XML element is mapped to a relation.
Example 1a
XML ltagtltbgtltcgtltdgthellolt/dgtlt/cgtlt/bgtlt/agt
-gt Relations a, b, c and d.

12
Problems of Naïve Mapping

Many relations.
Ineffective queries multiple query joins.
Example 1b
XPath Query //a
SQL Query need to join the relations a, b, c and
d.

13
Inlining Algorithms

First proposed by Shanmugasundaram, et. al.
Expanded by Lu, Lee, Chu and others.
Extended in various directions by various
researchers, e.g.,
Preserving XML element orders.
Preserving XML constraints.
Do not consider extensions here.

14
Basic Idea of Inlining Algorithms

Inline child element into the relation for the
parent element when appropriate.
Different inlining algorithms differ in inlining
criteria.
Example 1c XML ltagtltbgtltcgtltdgthellolt/dgtlt/cgtlt/bgtlt/agt
Inlined Relation a.

15
Inlining Algorithms

Child elements attributes may be inlined.
Child elements may not have their own relations.
Results in less number of relations.
In general, more inlining -gt less joins.

16
Inlining Algorithm Structure

Simplification of DTD.
Generation of DTD graphs
Generation of Relational Schemas

17
Our Preliminary Results

A more complete and optimal DTD Simplification
Algorithm
A generic DTD Graph that can be used by inlining
algorithms.
Inlining Considerations framework for analyzing
inlining algorithm
A new and aggressive inlining algorithm

18
Examples of Our Work

Use DTD Simplification as an example of the
flavor of our work.
Show the new Inlining Algorithm.

19
Brief Introduction to DTD

DTD a simple language to describe XML
vocabulary
Element declarations contents of elements.
Attribute declarations types and properties of
attributes.
DTD is still very popular.

20
DTD Element Declarations

Define element contents
PCDATA string
ANY anything go
EMPTY no content (attributes only)
Content models child elements.
Mixed contents child elements and strings.

21
DTD Example

Example 2 A complete DTD
lt!ELEMENT addressBook (person)gt
lt!ELEMENT person (name,email)gt
lt!ELEMENT name (last,first)gt
lt!ELEMENT first (PCDATA)gt
lt!ELEMENT last (PCDATA)gt
lt!ELEMENT email (PCDATA)gt
lt!ATTLIST person id ID REQUIREDgt

22
Operators for Element Declaration

, sequence
1 or more
0 or more
? optional 0 or 1
choice
() parenthesis

23
Simplification of DTD

Mapping of DTD to Relational Schemas
Input DTDs
Output Relational Schemas
DTD can be complicated gt simplification.
Example 3
lt!ELEMENT a (b,((b,c)(d,b,c?)),(e,f)?)gt

24
Simplification Principles

The relational schema needs to store all possible
scenarios.
Some relations/columns may not be populated in
some instances.
Example 3
lt!ELEMENT a (bc)gt and
lt!ELEMENT a (b,c)gt
May be the same from the RDBs point of view.

25
Simplification Details

Comma-separated clauses only operators remain
(), , and .
-gt , e.g. a -gt a.
Removal of and ?, e.g. (ab?) -gt (a,b)
Removal of (), e.g. (a, (b)) -gt (a,b)
Removal of repetition, e.g. (a, b, a) -gt (a, b)
Note that element orders are not preserved.

26
Previous Simplification Results

Not complete e.g.
Shanmugasundaram not specify how to handle .
Lu not specify how to remove ().
Not optimal (may generate when it is not
needed).
Example 4a For Lu and Lee, 2 steps
(b(b,c)) -gt (b,b,c) -gt (b,c)

27
Our Simplification Algorithm

A set of definitions.
A set of 7 simplification rules.
An algorithm on how and when to use them.
Example 4b For us, 1 step
(b(b,c)) -gt (b,c)

28
Simplification Rules
29
Simplification Algorithm
30
Complexity

Time complexity O(Nop)
Where Nop is the total number of operators
(including parentheses) in the element
declarations of the DTD.

31
Advantages

Complete handle all DTDs.
Optimal in the sense that will not be
generated if not needed.
Example 5
lt!ELEMENT a (b,((b,c)(d,b,c?)),(e,f)?)gt
gt lt!ELEMENT a (b,c,d,e,f)gt

32
A New Inlining Algorithm (1)

Aggressive in inlining.
More complete.
Elaborated algorithms.
Handle more details e.g. element types of ANY,
EMPTY and mixed contents.

33
A New Inlining Algorithm (2)
34
A New Inlining Algorithm (3)
35
A New Inlining Algorithm (4)
36
Main Results

Yue, K., Alakappan, S. and Cheung, W., A
Framework of Inlining Algorithms for Mapping DTDs
to Relational Schemas, Technical Report
COMP-05-005, Computer Science Department, the
Hong Kong Baptist University, 2005,
http//www.comp.hkbu.edu.hk/en/research/?contentt
ech-reports.

37
Future Works

Implemented the algorithms and tested with many
DTDs.
Need to implement the XQuery/SQL bridge for
performance study.

38
Content

Areas of My Research Interest
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions

39
Measuring Web Bias

Search engines dominate how information are
accessed.
Search results have major social, political and
commercial consequences.
Are search engines bias?
How bias are them?

40
Previous Works

To measure bias, results should be compared to a
norm.
The norm may be from human experts.
Mowshowitz and Kawaguchi the average search
result of a collection of popular search engines
as the norm.

41
Mowshowitz and Kawaguchi
42
Limitations

Based on URL Vector -gt cannot measure bias
quality.

43
Our Approach

Use Kleinbergs HITS algorithm to create
clusters, authorities and hubs of the result norm
URLs.
Use them as norm clusters, authorities and hubs.
Measure distances between norms and individual
results as bias.

44
HITS

Obtain a directed graph G where
Node page
Edge URL link from between pages.
Two indices xp,i (authority) yp,i (hub)
Iterate until steady state
xp,i1 lt- ? q,q-gtp yq,i
yp,i1 lt- ? q,p-gtq xq,i

45
Our Approach
46
Current Progress

Implemented previous results.
Implemented vector analysis
Implemented HITS algorithm, but it is not
accurate enough
Conglomerate effect.

47
Measuring Pages Information Quality

People find information from Web pages.
How good is the content of a given page?

48
Previous Works

Measuring different kinds of quality
Web site design quality
Navigational quality
Many framework on how to measure information
quality
Most results in surveys so users can rank
informational quality.
Very few automated or semi-automated tool.

49
Our Objectives

Build automated and/or semi-automated tool to
measure and/or assist user to measure information
quality of a Web page.

50
Approach

Hypothesis, measure, usage guidelines.
Example
Hypothesis a Web page with many spelling
mistakes is likely to have low information
quality.
Measures
Show frequencies of word occurrences.
Show percentage of spelling mistakes.
Usage guideline
Spelling mistakes may not be actual mistakes
(e.g. UHCL).

51
Metrics

Many potential metrics. Some examples
Broken links
HTML Quality
Domain names
Page ranking and popularity
Appearance in directory structure
History (e.g. Way back machine)
Currency (e.g. last modified)
Author (e.g. Meta tag)

52
Current Progress

Pre-alpha prototype http//dcm.cl.uh.edu/yue/ut
il/pageInfo.pl
A capstone project

53
Content

Areas of My Research Interest.
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions

54
Conclusions

Good time to do applied computing research in the
Web and XML areas.
Style hands-on supervision publications.
Don't forget to donate a scholarship to the
School if your future research leads to a
windfall.

Some of my XML/Internet Research Projects - PowerPoint PPT Presentation

Some of my XML/Internet Research Projects

Title: XML: An Overview Author: K Yue Last modified by: K Yue Created Date: 9/10/2005 3:51:59 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation