Title: NF-SS: A Normal Form for Semistructured Schemata
1 NF-SS A Normal Form for Semistructured Schemata
- Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong
Li Lee - National University of Singapore
- Gillian Dobbie
- University of Auckland, New Zealand
2Outline
- Motivations
- Semistructured schema and its data tree
- Integrity constraints for semistructured data
- NF-SS Normal Form for Semistructured Schemata
- Designing of semistructured schema into NF-SS
- Discussions of the designing approach
- Comparison with related proposal
- Summary
-
31. Motivation Example 1
- lt!ELEMENT department (course)
- lt!ATTLIST department
- name ID
REQUIREDgt - lt!ELEMENT course (students)gt
- lt!ATTLIST course
- cid ID
REQUIRED - title CDATA
impliedgt - lt!ELEMENT student (grade?)gt
- lt!ATTLIST student
- sid ID
REQUIRED - name CDATA
REQUIRED - age CDATA
IMPLIEDgt - lt!ELEMENT grade (PCDATA)gt
41. Motivation (cont.)
- Redundancy name and age of a student
- Updating Anomaly
- Insertion
- Rewriting
- Deletion
51. MotivationExample 2
- lt!ELEMENT teacher (ClassRoom)gt
- lt!ATTLIST teacher tid ID
REQUIREDgt - name CDATA REQUIREDgt
- lt!ELEMENT ClassRoom (subject)gt
- lt!ATTLIST ClassRoom room ID
REQUIREDgt - lt!ELEMENT subject (time)gt
- lt!ATTLIST subject
- cid ID
REQUIREDgt - lt!ELEMENT time EMPTYgt
- lt!ATTLIST day CDATA
REQUIRED - hour CDATA
REQUIREDgt
- Path anomaly
- The schema doesnt reflect the integrity
constraints tid,day,hour?cid,room
62. Semistructured Schema and Data tree
- A semistructured schema is defined to be D (E,
A, B, P, R, r)
- E is a finite set of object types in D.
E Object type
r root Object type
A attributes
- A is a finite set of attributes, disjoint from E.
- B is a set of basic domain type like string,
integer, Boolean etc.
- P is a function from E to object type definition
with symbol in , , ? ,1 called multiplicity - e.g P (course) student
multiplicity
- R is a function from E to the power set of A
- e.g. R(student) sid, name, age
- r ? E and is called the object type of the root.
- e.g. r department
72. Semistructured Schema and Data tree (Cont.)
A data tree T with respect to a semistructured
schema D (E, A, B, P, R, r) is defined to be a
tree T(V, lab, obj, att, val, root), showing a
database instance.
department
course
course
name CS
title data Mining
cid
title database design
cid
cs5220
cs4221
student
student
student
sid
sid
age
sid
age
name
name
name
grade
s01
21
s01
21
s02
Jack
Jack
Tom
A
82. Semistructured Schema and Data tree (Cont.)
- The path of a node n in semistructured schema D
is denoted as pathD(n). e.g. PathD for
student is /department / course / student
- The path of a node v in data tree T is denoted as
PathT(v) e.g. PathT for student s02 is
/department / course/ student
- The target set of node n in T, Tn, is v v?V,
n?E?A PathT(v) PathD(n). e.g. the target set
Tstudent includes nodes of students with sid
s02 etc.
92. Semistructured Schema and Data tree (Cont.)
- Two nodes from two data tree w.r.t schema D
satisfy value equality iff - they are attributes nodes with the same tag and
the same value - or they are object nodes having the same tag and
their children are pairwise value equal
- Two data trees T1 and T2 w.r.t schema D (E, A,
B, P, R, r), X ?E ? A. T1 and T2 agree on X,
denoted as iff the following condition is hold
?t1?T1X,t2?T2X, such that (t1vt2)
department
course
course
name CS
title data Mining
cid
cid
title database design
cs5220
cs4221
student
student
student
sid
age
sid
sid
name
age
name
name
grade
s01
s01
21
s02
Jack
21
Jack
Tom
A
103. Integrity Constraints for Semistructured Data
- Extended Functional Dependency(EFD)
- Let D (E, A, B, P, R, r) be a semistructured
schema, let X ? - E?A and Y ? E?A. Y is extended functionally
dependent on X, - is denoted as X?Y. Let S denotes a set of data
trees that are - images of D, S satisfies X?Y, iff for any data
trees T1, T2 in S, - if they agree on every component in X, then they
will agree on - Y.that is, ?T1, T2 ?S((?x?X, T1xT2) such that
T1yT2). - Inference rule for EFD
- E1(reflexivity) If Y?X, then X?Y, for any X, Y?
E?A - E2(augmentation) if X?Y then XZ?YZ, for any X,
Y, Z? E?A - E3(transitivity) If X?Y, Y?Z then X?Z, for any
X, Y, Z ? E?A
113. Integrity Constraints for Semistructured Data
(Cont.)
O1_at_X1, , Oi_at_Xi,,On-1_at_Xn-1?On_at_Xn
- Notation
- EFD X?Y is partial EFD If there exists an X?X
such that X?Y. Otherwise, is full EFD. - e.g. (1) course_at_cid,student_at_sid?student_at_nam
e is partial EFD - (2) student_at_sid?student_at_name its full
EFD - X?Y is said to be coherent iff /X/Y is a path in
D otherwise it is called an incoherent EFD.
e.g.teacher_at_tid, time _at_day,
_at_hour?subject_at_cid is an incoherent EFD, since
/teacher / time /subject is not a path in schema.
123. Integrity Constraints for Semistructured Data
(Cont.)
- If there exists Z?E?A, such that X?Y and Y?Z and
Y X, then Z is transitively extended
functionally dependent on X via Z. - e.g. age is transitively dependent on course via
student since - (1) course_at_cid?student_at_sid
- (2) student_at_sid?student_at_age and
- (3)student_at_sid course_at_cid
133. Integrity Constraints for Semistructured Data
(Cont.)
- Theorem Let D (E, A, B, P, R, r) be a
semistructured schema, X, Y, Z ? E ?A. If Z is
transitively dependent on X via Y, then there
exists a data tree of D where a rewriting anomaly
occurs upon updating the values of Z.
143. Integrity Constraints for Semistructured Data
(Cont.)
- Key Constraints Based on EFD semantics
- Notation Ko O1_at_X1//Oi_at_Xi//On_at_Xn/O_at_X
- for key of an object type O in
semistructured schema D. - /O1//O is a path in D
- If n equals one, then Ko is called an
absolute key. Otherwise it - is called a relative key.
- Example
- Kbook book_at_isbn. Kbook is an absolute key
- Kchapter book_at_isbn/chapter_at_number.
Kchapter is a relative key - Ksection book_at_isbn/chapter_at_number/section_at_nu
mber. Ksection is a relative key
153. Integrity Constraints for Semistructured Data
(Cont.)
- Let D be a semistructured schema and O be its
root object - type. The set of basic dependencies of D, denoted
as BD(D), is - defined as follows
- Let X, Y be children of O, non-trivial extended
functional dependencies of the form X?Y where X
is a key of O or Y is part of a key of O, are in
BD(D). - Let O1 be a sub-object type of O and D1 be a
schema tree that is rooted at O1 and add KO as
attribute(s) of O1, then BD(D1) ? BD(D). - No other non-trivial dependencies that is not
generated from above is in BD(D)
164. NF-SS
- Let D be a semistructured schema and O be its
root object type. D is in Normal Form for
Semistructured Schemata (NF-SS), iff - O has at least one key.
- For any non-trivial EFD of the form X?Y
satisfied by O, where X and Y are attributes of
O, then either X is a key or Y is part of the key
of O - For any sub-object type O1 of O
- (a) If adding KO to O1 as its
components with other remains, - a schema tree rooted at O1
will be in NF-SS. - (b) KO ?KO1? or KO ?KO1, where KO
and KO1 are O and O1s key - respectively.
- (c) O1 is not transitively dependent
on KO - 4. Any non-trivial EFD in D can be derived
from BD(D) by using the - inference rules for EFDs.
175. Designing Semistructured Schema into NF-SS
- We adopt restructuring approach for the
designing. - We propose four heuristic restructuring rules
- Decomposition object types.
- Creation new object types.
- Regrouping components of an object type.
- Objective
- Remove transitive or partial EFD and incoherent
EFD from the given dependency and key constraints.
185. Designing Semistructured Schema into
NF-SS(cont.)
- Rule 1. (Remove Transitive Dependency by
Decomposition) - Given an object type O in a semistructured schema
D, if there is - some non-prime component(s) Y of O that is
transitively - dependent on some key of O, i.e., KO ?X, X ? Y
and X KO , and - X ? KO ?. Then, restructuring the schema as
follows. - 1. Duplicate X to form a new node(s) Z.
- 2. Move Y and all the descendants of Y and
their corresponding - edges under Z.
- 3. Make X as foreign key of O, and add a
reference edge from - the original node X to Z.
195. Designing Semistructured Schema into
NF-SS(cont.)
- Example 5.1 schema D satisfies the following
EFDs - (1)department_at_name?course_at_cid (2)
course_at_cid?department
- (3)course_at_cid?course_at_title
(4)course_at_cid?student_at_sid - (5)course_at_cid,student_at_sid?grade
(6)student_at_sid?student_at_name, _at_age
205. Designing Semistructured Schema into
NF-SS(cont.)
- Rule 2. Remove Path Anomaly by Path Splitting
- Given a semistructured schema D. Suppose there
exists an - incoherent EFD O1_at_X1,,On_at_Xn ? Y, Y is
either an object - type or an attribute, and there exists a path P
that contains - O1,,On,Y. Path P can be split into two
sub-paths P1 and - P2,where P1 only contains O1,,On and Y,
while P2 contains - O1,,On and (P-Y).
215. Designing Semistructured Schema into
NF-SS(cont.)
- Example 5.2schema D satisfies following EFDs
- (1) teacher_at_tid,time?ClassRoom
(2)teacher_at_tid, time?subject
225. Designing Semistructured Schema into
NF-SS(cont.)
- Rule 3. Removing Partial Dependency by Creating
New - Object type
- Given an object type O in a semistructured
schema, let X be a - set of prime attributes of O, and Y be the set of
Os - attributes. Let O1 be a sub-object type of O. If
(KO -X) ? O1 - and no proper superset of X satisfy this
property, then - restructure the schema as follows
- 1. (KO ?Y X) becomes the only attribute(s) of
O while O1 - remains to be its sub-object type.
- 2.Create a new object type O2 that is a direct
component of O. - 3.Move rest of the components of O and all
their descendants and corresponding edges under
O2.
235. Designing Semistructured Schema into
NF-SS(cont.)
- Example 5.3 schema D shown in Figure (a). the
following EFDs O_at_A,_at_B?D, O_at_A,_at_B?O2, O_at_A?
O1, O_at_A ?E and the key of O is A,B.
245. Designing Semistructured Schema into
NF-SS(cont.)
- Rule 4. (Restructuring To Satisfy Condition 3(b)
of NF-SS Definition) - Given an object type O in a semistructured schema
D, X be a - set of Os attributes and single-valued atomic
sub-object - types, O1 be a complex sub-object type of O. O1
has relative - key KO1 , but KO ? KO1 and KO1 KO .Let Y be
KO ? KO1 ? X, and Y - ??. D is restructured as follows
- 1. O1 remains to be a sub-object type of O.
- 2. Make Y as components of O.
- 3.Create a new object type O2 to be a child
of O and the rest components of O (excluding Y)
become children of O2.
255. Designing Semistructured Schema into
NF-SS(cont.)
- Example 5.4 schema D in Figure (a) satisfies the
EFD (1) O_at_K, _at_A? O1 (2) O_at_K, _at_B?O2 and the
key of O is K, A, B.
265. Designing Semistructured Schema into
NF-SS(cont.)
- Algorithm 1 Restructuring Algorithm
- Input A set S that contains semistructured
schemas, and a set of - EFDs for S.
- Output A set of semistructured schemas that in
NF-SS. - Begin
- 1. for each semistructured schema D in S do
- if D is not in NF-SS then repeat until no
further change - (1) if there exists transitive EFD KO ? X, X
? Y and X KO for an - object type O in D,
- Case X ? KO ? apply Rule 1 to remove
the transitive EFD. - Case X ? KO apply Rule 3 to remove the
transitive EFD. - Case X ? KO ?? apply Rule 4 to remove
the transitive EFD. - (2) if there exists incoherent EFD then apply
Rule 2 to remove it. - 2. output S.
- End
276. Discussion of Restructuring Approach for
Designing
- Is the restructuring rules complete? No.
- covering is not guaranteed
- dependency preservation is not guaranteed
- Does it give unique solution? No.
- depending on the order in which the dependencies
are examined - Designing task can be made easier if more
semantics available. - In 5, We have proposed another approach for
designing semistructured databases using ORA-SS,
a semantic rich model . - Nevertheless, it does give practical heuristics
and provides insights into the normalization task
for semistructured databases.
287. Comparison with Related Proposal
- The first attempt to define normal form for
semistructured data - (ER99 S.Y.Lee, M.L.Lee, T.W.Ling, and
L.A.Kalinichenko.) 3 - Defines a schema called S3-Graph, which makes no
distinction between element node and attribute
node and no cardinality specification. - Proposes S3-NF, but missing key constraints, an
essential part of database design. - The decomposition method may not be able to
remove some other kinds of anomalies, like
partial dependency and path anomaly that may
exist in a schema. - The most recent proposal XNF (XML Normal Form)
- (ER 2001 D.W.Embley and W.Y.Mok. ) 2
- It mainly provides algorithms to translate a
schema, represented in a conceptual model called
CM hypergraphs, to a scheme-tree forest in XNF. - Like S3-Graph, scheme tree doesn't lend itself to
XML definition. - XNF isnt formulated with the concept of key.
- The algorithms given suffers from efficiency.
- A large set of results is expected.
298. Summary
- A normal for semistructured schemata
- It is incorporated with integrity constraints.
- It guarantees no redundancy and hence no
undesirable updating anomalies for the conforming
semistructured databases. - It gives more reasonable representations of real
world semantics - Restructuring Approach for designing
semistructured databases - a set of heuristic restructuring rules is
proposed. - an algorithm for iteratively restructuring a
schema into NF-SS is developed. - It provides insights into the normalization task
for semistructured databases.
30References
- 1. J. Clark and S. DeRose. XML Path Language
(XPath). W3C Working Darft, November 1999.
http//www.w3.org/TR/xpath. - 2.D.W.Embley and W.Y.Mok. Developing XML
Documents with Guaranteed Good Properties.
Proceedings of the 20th International Conference
on Conceptual Modeling (ER), 2001. - 3. S. Y. Lee, M. L. Lee, T. W. Ling and L. A..
Kalinichenko. Designing Good Semi-structured
Databases. Proceedings of the 18th International
Conference on Conceptual Modeling (ER), 1999. - 4. T. W. Ling and L. L. Yan. NF-NR A Practical
Normal Form for Nested Relations. Journal of
Systems Integration. Vol4, 1994, pp309-340 - 5. Xiaoying Wu, Tok Wang Ling, Mong Li Lee,
Gillian Dobbie. Designing Semistructured
Databases Using the ORA-SS Model, accepted for
publication in Proceedings of the 2nd
International Conference on Web Information
Systems Engineering (WISE) , IEEE Computer
Society, Kyoto, Japan, December 2001.
31QA