Title: Some of my XML/Internet Research Projects
1Some of my XML/Internet Research Projects
- CSCI 6530
- October 5, 2005
- Kwok-Bun Yue
- University of Houston-Clear Lake
2Content
- Areas of My Research Interest
- Some Current Projects
- Storage of XML in Relational Database
- Example Internet Computing Projects
- Conclusions
3Areas of My Research Interest
- Internet Computing
- XML
- Databases
- Concurrent Programming
4Content
- Areas of My Research Interest
- Some Current Projects
- Storage of XML in Relational Database
- Example Internet Computing Projects
- Conclusions
5Some Current Projects
- Storage of XML in relational database
- Measuring Web bias using authorities and hubs
- Measuring information quality of Web pages
- Distributed computer security laboratory
- Collaborative Open Community for developing
educational resources - Generalized exchanges within organizations
6Some Recent Student Work
- McDowell, A., Schmidt, C. Yue, K., Analysis and
Metrics of XML Schema, Proceedings of the 2004
International Conference on Software Engineering
Research and Practice, pp 538-544, Las Vegas,
June 2004. - Yang A., Yue K., Liaw K., Collins G., Venkatraman
J., Achar S., Sadasivam K., and Chen P.,
Distributed Computer Security Lab and Projects,
Journal of Computing Sciences in Colleges. Volume
20, Issue 1. October 2004. - Yue, K., Alakappan, S. and Cheung, W., A
Framework of Inlining Algorithms for Mapping DTDs
to Relational Schemas, Technical Report
COMP-05-005, Computer Science Department, the
Hong Kong Baptist University, 2005,
http//www.comp.hkbu.edu.hk/en/research/?contentt
ech-reports.
7Content
- Areas of My Research Interest
- Some Current Projects
- Storage of XML in Relational Database
- Example Internet Computing Projects
- Conclusions
8Storing XML in RDB
- Advantages
- Mature database technologies.
- May be queried by
- XML technology e.g. XPath, XQuery.
- RDB technology e.g. SQL.
- Disadvantages
- impedance mismatch XML and relations are
different data models.
9Related Issues
- Effective mapping XML DTDs ( ordered tree model)
to relational schemas. - Mapping of XML queries (e.g. XQuery) to RDB
queries (e.g. SQL). - Mapping of RDB query results back to XML format.
10Related Work and Context
- Mapping
- With or without schemas for XML.
- With or without user input.
- Schemas for XML
- Document Type Definition (DTD)
- XML Schema
- We consider mapping with DTD and without user
input.
11Naïve Mapping
- An XML element is mapped to a relation.
- Example 1a
- XML ltagtltbgtltcgtltdgthellolt/dgtlt/cgtlt/bgtlt/agt
- -gt Relations a, b, c and d.
12Problems of Naïve Mapping
- Many relations.
- Ineffective queries multiple query joins.
- Example 1b
- XPath Query //a
- SQL Query need to join the relations a, b, c and
d.
13Inlining Algorithms
- First proposed by Shanmugasundaram, et. al.
- Expanded by Lu, Lee, Chu and others.
- Extended in various directions by various
researchers, e.g., - Preserving XML element orders.
- Preserving XML constraints.
- Do not consider extensions here.
14Basic Idea of Inlining Algorithms
- Inline child element into the relation for the
parent element when appropriate. - Different inlining algorithms differ in inlining
criteria. - Example 1c XML ltagtltbgtltcgtltdgthellolt/dgtlt/cgtlt/bgtlt/agt
- Inlined Relation a.
15Inlining Algorithms
- Child elements attributes may be inlined.
- Child elements may not have their own relations.
- Results in less number of relations.
- In general, more inlining -gt less joins.
16Inlining Algorithm Structure
- Simplification of DTD.
- Generation of DTD graphs
- Generation of Relational Schemas
17Our Preliminary Results
- A more complete and optimal DTD Simplification
Algorithm - A generic DTD Graph that can be used by inlining
algorithms. - Inlining Considerations framework for analyzing
inlining algorithm - A new and aggressive inlining algorithm
18Examples of Our Work
- Use DTD Simplification as an example of the
flavor of our work. - Show the new Inlining Algorithm.
19Brief Introduction to DTD
- DTD a simple language to describe XML
vocabulary - Element declarations contents of elements.
- Attribute declarations types and properties of
attributes. - DTD is still very popular.
20DTD Element Declarations
- Define element contents
- PCDATA string
- ANY anything go
- EMPTY no content (attributes only)
- Content models child elements.
- Mixed contents child elements and strings.
21DTD Example
- Example 2 A complete DTD
- lt!ELEMENT addressBook (person)gt
- lt!ELEMENT person (name,email)gt
- lt!ELEMENT name (last,first)gt
- lt!ELEMENT first (PCDATA)gt
- lt!ELEMENT last (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- lt!ATTLIST person id ID REQUIREDgt
22Operators for Element Declaration
- , sequence
- 1 or more
- 0 or more
- ? optional 0 or 1
- choice
- () parenthesis
23Simplification of DTD
- Mapping of DTD to Relational Schemas
- Input DTDs
- Output Relational Schemas
- DTD can be complicated gt simplification.
- Example 3
- lt!ELEMENT a (b,((b,c)(d,b,c?)),(e,f)?)gt
24Simplification Principles
- The relational schema needs to store all possible
scenarios. - Some relations/columns may not be populated in
some instances. - Example 3
- lt!ELEMENT a (bc)gt and
- lt!ELEMENT a (b,c)gt
- May be the same from the RDBs point of view.
25Simplification Details
- Comma-separated clauses only operators remain
(), , and . - -gt , e.g. a -gt a.
- Removal of and ?, e.g. (ab?) -gt (a,b)
- Removal of (), e.g. (a, (b)) -gt (a,b)
- Removal of repetition, e.g. (a, b, a) -gt (a, b)
- Note that element orders are not preserved.
26Previous Simplification Results
- Not complete e.g.
- Shanmugasundaram not specify how to handle .
- Lu not specify how to remove ().
- Not optimal (may generate when it is not
needed). - Example 4a For Lu and Lee, 2 steps
- (b(b,c)) -gt (b,b,c) -gt (b,c)
27Our Simplification Algorithm
- A set of definitions.
- A set of 7 simplification rules.
- An algorithm on how and when to use them.
- Example 4b For us, 1 step
- (b(b,c)) -gt (b,c)
28Simplification Rules
29Simplification Algorithm
30Complexity
- Time complexity O(Nop)
- Where Nop is the total number of operators
(including parentheses) in the element
declarations of the DTD.
31Advantages
- Complete handle all DTDs.
- Optimal in the sense that will not be
generated if not needed. - Example 5
- lt!ELEMENT a (b,((b,c)(d,b,c?)),(e,f)?)gt
- gt lt!ELEMENT a (b,c,d,e,f)gt
32A New Inlining Algorithm (1)
- Aggressive in inlining.
- More complete.
- Elaborated algorithms.
- Handle more details e.g. element types of ANY,
EMPTY and mixed contents.
33A New Inlining Algorithm (2)
34A New Inlining Algorithm (3)
35A New Inlining Algorithm (4)
36Main Results
- Yue, K., Alakappan, S. and Cheung, W., A
Framework of Inlining Algorithms for Mapping DTDs
to Relational Schemas, Technical Report
COMP-05-005, Computer Science Department, the
Hong Kong Baptist University, 2005,
http//www.comp.hkbu.edu.hk/en/research/?contentt
ech-reports.
37Future Works
- Implemented the algorithms and tested with many
DTDs. - Need to implement the XQuery/SQL bridge for
performance study.
38Content
- Areas of My Research Interest
- Some Current Projects
- Storage of XML in Relational Database
- Example Internet Computing Projects
- Conclusions
39Measuring Web Bias
- Search engines dominate how information are
accessed. - Search results have major social, political and
commercial consequences. - Are search engines bias?
- How bias are them?
40Previous Works
- To measure bias, results should be compared to a
norm. - The norm may be from human experts.
- Mowshowitz and Kawaguchi the average search
result of a collection of popular search engines
as the norm.
41Mowshowitz and Kawaguchi
42Limitations
- Based on URL Vector -gt cannot measure bias
quality.
43Our Approach
- Use Kleinbergs HITS algorithm to create
clusters, authorities and hubs of the result norm
URLs. - Use them as norm clusters, authorities and hubs.
- Measure distances between norms and individual
results as bias.
44HITS
- Obtain a directed graph G where
- Node page
- Edge URL link from between pages.
- Two indices xp,i (authority) yp,i (hub)
- Iterate until steady state
- xp,i1 lt- ? q,q-gtp yq,i
- yp,i1 lt- ? q,p-gtq xq,i
45Our Approach
46Current Progress
- Implemented previous results.
- Implemented vector analysis
- Implemented HITS algorithm, but it is not
accurate enough - Conglomerate effect.
47Measuring Pages Information Quality
- People find information from Web pages.
- How good is the content of a given page?
48Previous Works
- Measuring different kinds of quality
- Web site design quality
- Navigational quality
- Many framework on how to measure information
quality - Most results in surveys so users can rank
informational quality. - Very few automated or semi-automated tool.
49Our Objectives
- Build automated and/or semi-automated tool to
measure and/or assist user to measure information
quality of a Web page.
50Approach
- Hypothesis, measure, usage guidelines.
- Example
- Hypothesis a Web page with many spelling
mistakes is likely to have low information
quality. - Measures
- Show frequencies of word occurrences.
- Show percentage of spelling mistakes.
- Usage guideline
- Spelling mistakes may not be actual mistakes
(e.g. UHCL).
51Metrics
- Many potential metrics. Some examples
- Broken links
- HTML Quality
- Domain names
- Page ranking and popularity
- Appearance in directory structure
- History (e.g. Way back machine)
- Currency (e.g. last modified)
- Author (e.g. Meta tag)
52Current Progress
- Pre-alpha prototype http//dcm.cl.uh.edu/yue/ut
il/pageInfo.pl - A capstone project
53Content
- Areas of My Research Interest.
- Some Current Projects
- Storage of XML in Relational Database
- Example Internet Computing Projects
- Conclusions
54Conclusions
- Good time to do applied computing research in the
Web and XML areas. - Style hands-on supervision publications.
- Don't forget to donate a scholarship to the
School if your future research leads to a
windfall.
55Questions?