Title: Incremental View Maintenance and Data Lineage Tracing in Heterogeneous Database Environments
1Incremental View Maintenance and Data Lineage
Tracing in Heterogeneous Database Environments
- Hao Fan
- School of Computer Science Information Systems
- Birkbeck college, University of London
- 9th July 2002
2Outline of presentation
- Introduction
- Background of my research
- AutoMed, IQL
- Research aims and approaches
- Research results so far
- Future work
3Introduction
- What is Incremental View Maintenance
- The open problems of Incremental View Maintenance
- What is Data Lineage
- The open problems of Data Lineage
- Applications of the two topics
4What is Incremental View Maintenance (IVM)
V
V
Computing changes of the view based on the
changes in the source data-bases, which is much
cheaper than recomputing the view from scratch.
OP
OP
D1
D2
d1
d2
5The open problems of Incremental View Maintenance
- Incremental view maintenance algorithms
- How to deal with the duplicate elements
- Algorithms in multi-source environment
- Incremental view schema maintenance
- Self-maintainable view
6What is Data Lineage
For a given ware-house data item, how to identify
the exact set of source data items.
7The open problems of Data Lineage Tracing (DLT)
- Definition of Data Lineage
- Derivation Tracing Query
- Data Lineage Tracing procedures
- Lineage Tracing with set/bag semantics
- Lineage tracing using auxiliary views
8Applications
- On-line Data analysis and Data mining (OLAP/OLAM)
- Scientific Databases
- Data cleaning
- Authorization management
- Materialized view schema evolution
- View update problem
9AutoMed
- HDM (Hypergraph Data Model)
- ltNodes, Edges, Constraintsgt
- Instance / Extension Mapping (Exts,I(c))
- 8 Primitive Transformations
- Composite Transformations / Transformation
Pathways - IQL language
10Simple IQL queries
- 7. q gc aggFun D
- (aggFun max min count sum
avg) - /group and compute a bag of pairs on their
first component and apply an aggregation function
to the second component/ - 8. q p p ? D1 member D2 e /members of D1
that are members of D2/ - 9. q p p ? D1 not (member D2 e) /members
of D1 that are not members of D2/ - 10. q p ? p1 ? D1 pr ? Dr c1 ck / a
comprehension/
- 1. q D1 D2 Dr
- / bag union/
- 2. q D1 -- D2
- / bag monus /
- 3. q group D
- / group a bag of pairs on their first
component/ - 4. q sort D
- 5. q sortDistinct D
- /sort and remove duplicates/
- 6. q aggFun D
- (aggFun max min count sum
avg)
11Example for transforming between HDM schemas
person
mathematician
compScientist
dept
salary
salary
TS1,S2
avgDeptSalary
Schema S1
Schema S2
12The Transformation Pathway TS1,S2
- addNode (dept, Maths,CompSci)
- addNode (person, x x ? mathematician x
x ? compScientist) - addNode (avgDeptSalary, avg s (m,s)?_,
mathematician, salary - avg s (c,s)?_, compScientist,
salary) - addEdge (_, dept, person, ( Maths, x) x ?
mathematician - (CompSci, x) x ? compScientist)
- addEdge (_, person, salary, _,
mathematician,salary_, compScientist,
salary) - addEdge (_, dept, avgDeptSalary,(Maths, avg
s (m,s)?_,mathematician,salary), - (CompSci, avg s (c,s)?_, compScientist,
salary)) - delEdge (_, mathematician, salary, (p, s)
(d, p) ? _, dept, person - (p, s) ? _, person, salary d Maths p
p) - delEdge (_, compScientist, salary, (p, s)
(d, p) ? _, dept, person - (p, s)? _, person, salary d CompSci
p p) - delNode (mathematician, p (d, p)? _, dept,
person d Maths) - delNode (compScientist, p (d, p) ? _, dept,
person d CompSci)
13Research Aims
- Develop techniques for the two topics based on
simple IQL queries - Investigate how the Automed transformation
pathways can be used for both of them - Design and Implement procedures for IVM and DLT
using Automed transformation pathways - Explore the relationship between the two
processes, and - Derive an integrated approach to combine the two
processes
14Research Approach
- Consider simple formulae and methods for simple
IQL queries - Investigate individual transformation step in a
transformation pathway - Design entire procedures using the whole Automed
transformation pathways and Implement them - Handle the situation with arbitrary IQL queries
15Preliminary ideas results achieved so far
- Data Lineage Tracing in Automed
- Incremental View Maintenance In Automed
16Incremental View Maintenance in AutoMed
- ?D the tuples inserted into base relation, D,
- ?D the tuples deleted from base relation, D,
- ?V the tuples inserted into the materialized
view, V, - ?V the tuples deleted from the materialized
view, V, - (?D, ?D, ?V and ?V are possibly empty.)
- (Minimality Condition) ?V and ?V should satisfy
the following minimality conditions - (a) ?V ? V we only delete tuples that are in
V - (b) ?V n ?V Ø we do not delete a tuple and
then reinsert it.
17Incremental view maintenance with simple IQL
queries (1-3)
18Incremental view maintenance with simple IQL
queries (2-3)
19Incremental view maintenance with simple IQL
queries (3-3)
20Incremental view maintenance using Automed
transformation pathways
S
GS
TP tp1, , tpr
D
V
21Data Lineage Tracing in Automed
Classification of Data Lineage
- Affect-provenance
- which includes all of the source data that had
some influence on the result data
- Origin-provenance
- that is the specific data in the source
databases from which the resulting data is
exactly extracted.
22Data Lineage with set semantics in IQL
Affect-set for a simple query in IQL ts
affect-set in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are maximal subsets of T1, , Tm such
that (a) q(T1, , Tm) t (b) ?Ti
q(T1, , Ti, , Tm) t ? Ti ? Ti (c)
?Ti ?t ? Ti q(T1, , t, ,
Tm) ? Ø
Origin-set for a simple query in IQL ts
origin-set in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are minimal subsets of T1, , Tm such
that (a) q(T1, , Tm) t (b) ?Ti Ti ?
Ti q(T1, , Ti, , Tm) ? t (c)
?Ti ?t ? Ti q(T1, , t, ,
Tm) ? Ø
23For Example
Set T1 1, 2, 3, T2 3, 4, 5, V T1 --T2
1, 2. We obtain 1s affect-set as follows 1.
To satisfy the conditions (a) and (b), T1
1, 3 T2 3, 4, 5 2. ? 3 -- 3, 4, 5
Ø, not satisfying (c) ? delete 3 fromT1,
then T1 1 3. The affect-set of 1 (1 ? V)
is T1 1 T2 3, 4, 5
24Data Lineage with bag semantics in IQL
Affect-Pool for a simple query in IQL ts
affect-pool in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are maximal subsets of T1, , Tm such
that (a) q(T1, , Tm) x x ? T x t
(b) ?Ti q(T1, , Ti, , Tm) x x
? T x t ? Ti ? Ti (c) ?Ti ?t ? Ti
q(T1, , t, , Tm) ? Ø
Origin-Pool for a simple query in IQL ts
origin-pool in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are minimal subsets of T1, , Tm such
that (a) q(T1, , Tm) x x ? T x
t (b) ?Ti ?t ? Ti q(T1, , x x ? Ti x
? t, , Tm) ? x x ? T x t (c) ?Ti
?t ? Ti q(T1, , t, , Tm) ? Ø
(d) ?Ti ??t t ? Ti, t ? (Ti -- Ti)
25Affect- and Origin-pool for a tuple with IQL
simple queries
26Integrating AutoMed primitive transformation step
- addNode / addEdge ? addConstruct
- delNode / delEdge ? delConstruct
- renameNode / renameEdge ? renameConstruct
- ignore addConstraint /delConstraint
27Pre-analysis of the data Lineage problem for each
Automed transformation step
a) For an addConstruct(O, q) transformation
The lineage of data in the schema construct O
is located in the constructs that appear in the
query q. b) For an renameConstruct(O, O)
transformation The lineage of data in the
schema construct O is located in the source
construct O. c) All delConstruct(O, q)
transformations can be ignored since they create
no schema construct.
28Attributes for each input item
t the tracing tuple. O a construct in
integrated schema GS relateTP the
transformation step that created O. extent
the current extent of O. tp a transformation
step in the transformation pathway
opreatorType add , del or ren query
the query used in tp (if any) source
O (for renameConstruct (O ,O)) or all
constructs appearing in the q (for addConstrucr
(O, q)) result O (for renameConstruct (O
,O) and addConstrucr (O, q))
29The Outputs of procedures
- The result of all procedures is
- a sequence of pair
- lt(D1, O1), , (Dn, On)gt
- In which (t is the tracing tuple)
- Di a bag contains ts derivation
- Oi the construct whose extent contains Di
30Tracing derivation Algorithm for a tuple
procedure affectPoolofTuple(t, O) begin
if (O.realteTP Ø) DL lt(t, O)gt else
D (O.extent O) O ? O.relateTP.source
D TQAPD(t) DL (B, B.construct) B?
D return(DL) end
procedure originPoolofTuple(t, O) begin
if (O.realteTP Ø) DL lt(t, O)gt
else D (O.extent O) O ?
O.relateTP.source D TQOPD(t) DL (B,
B.construct) B? D return(DL) end
Tracing affect-pool for a tuple t
Tracing origin-pool for a tuple t
31Tracing derivation Algorithm for a set of tuples
procedure affectPoolOfSet(T, O) input a tracing
tuple set T t1, , tn, the construct O which
contains tuple set T. output Ts affect pool,
DL begin DL ltgt //the empty
sequence for each t ? T do DL
merge(DL, affectPoolOfTuple(ti, O))
return(DL) end
Derivation Tracing Procedures for a set of tuples
32The merge procedure
proc merge(DL, DLnew) input data lineage
sequence DL lt(D1, O1) , (Dn, On)gt new data
lineage sequence DLnew output merged data
lineage sequence DL begin for each
(Dnew, Onew) ? DLnew do if Onew Oi for some Oi
in DL then oldData Di
newData oldData x x ?
Dnew not (member oldData x) DL (DL
(oldData, Oi)) (newData, Oi) else DL
DL (Dnew, Onew) return(DL) end
33Algorithm for tracing derivation through Automed
transformation pathways
procedure traceAffectPool(B, O) begin DL lt(B,
O)gt //initiating the DL for j r downto 1
do case (tpj.transfType del)
continue case (tpj.transfType
ren) if tpj.result Oi
for some Oi in DL then
DL (DL (Di, Oi)) (Di, tpj.source)
case (tj.transfType add)
if tpj.result Oi for some Oi in DL
then DL DL
(Di, Oi) Di
sortDistinct Di
DL merge(DL, affectPoolOfSet(Di, Oi))
endfor return(DL) end
Derivation Tracing Procedures for entire
transformation pathways
34Further work (1)
- In next 9 months for my PhD research, I plan to
- Handle more complex IQL queries appearing in
transformation pathways. (1 month) - Implement my DLT algorithms. (2 months)
- Combine my approach for DLT with the problem of
IVM (2 months) - Implement the integrated approach (2 months)
- Extend my algorithms to a more expressive
transformation language. (3 months)
35Further work (2)
- Compare my approaches with other ones
- Extend my algorithms for other purposes
- Propagating security/access control information
to the integrated database - Maintaining view schema using AutoMed
transformation pathways - Extending my algorithms to environment where
there is not information of the data sources - Apply and evaluate my techniques in the area of
genomic data warehouses, in collaboration with an
ongoing Bio-informatics project at Birkbeck, UCL
and EBI ("Structural and Functional Annotation of
Genomes through Synchronised Data Warehouses").
36Acknowledge
- Prof. Alexandra Poulovassilis
- School of Computer Science Information Systems
- Birkbeck college, University of London