Incremental View Maintenance and Data Lineage Tracing in Heterogeneous Database Environments - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Incremental View Maintenance and Data Lineage Tracing in Heterogeneous Database Environments

Description:

Hao Fan. School of Computer Science & Information Systems. Birkbeck ... For a given ware-house data item, how to identify the exact set of source data items. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 37
Provided by: Hao99
Category:

less

Transcript and Presenter's Notes

Title: Incremental View Maintenance and Data Lineage Tracing in Heterogeneous Database Environments


1
Incremental View Maintenance and Data Lineage
Tracing in Heterogeneous Database Environments
  • Hao Fan
  • School of Computer Science Information Systems
  • Birkbeck college, University of London
  • 9th July 2002

2
Outline of presentation
  • Introduction
  • Background of my research
  • AutoMed, IQL
  • Research aims and approaches
  • Research results so far
  • Future work

3
Introduction
  • What is Incremental View Maintenance
  • The open problems of Incremental View Maintenance
  • What is Data Lineage
  • The open problems of Data Lineage
  • Applications of the two topics

4
What is Incremental View Maintenance (IVM)
V
V
Computing changes of the view based on the
changes in the source data-bases, which is much
cheaper than recomputing the view from scratch.
OP
OP
D1
D2
d1
d2
5
The open problems of Incremental View Maintenance
  • Incremental view maintenance algorithms
  • How to deal with the duplicate elements
  • Algorithms in multi-source environment
  • Incremental view schema maintenance
  • Self-maintainable view

6
What is Data Lineage
For a given ware-house data item, how to identify
the exact set of source data items.
7
The open problems of Data Lineage Tracing (DLT)
  • Definition of Data Lineage
  • Derivation Tracing Query
  • Data Lineage Tracing procedures
  • Lineage Tracing with set/bag semantics
  • Lineage tracing using auxiliary views

8
Applications
  • On-line Data analysis and Data mining (OLAP/OLAM)
  • Scientific Databases
  • Data cleaning
  • Authorization management
  • Materialized view schema evolution
  • View update problem

9
AutoMed
  • HDM (Hypergraph Data Model)
  • ltNodes, Edges, Constraintsgt
  • Instance / Extension Mapping (Exts,I(c))
  • 8 Primitive Transformations
  • Composite Transformations / Transformation
    Pathways
  • IQL language

10
Simple IQL queries
  • 7. q gc aggFun D
  • (aggFun max min count sum
    avg)
  • /group and compute a bag of pairs on their
    first component and apply an aggregation function
    to the second component/
  • 8. q p p ? D1 member D2 e /members of D1
    that are members of D2/
  • 9. q p p ? D1 not (member D2 e) /members
    of D1 that are not members of D2/
  • 10. q p ? p1 ? D1 pr ? Dr c1 ck / a
    comprehension/
  • 1. q D1 D2 Dr
  • / bag union/
  • 2. q D1 -- D2
  • / bag monus /
  • 3. q group D
  • / group a bag of pairs on their first
    component/
  • 4. q sort D
  • 5. q sortDistinct D
  • /sort and remove duplicates/
  • 6. q aggFun D
  • (aggFun max min count sum
    avg)

11
Example for transforming between HDM schemas
person
mathematician
compScientist
dept
salary
salary
TS1,S2
avgDeptSalary
Schema S1
Schema S2
12
The Transformation Pathway TS1,S2
  • addNode (dept, Maths,CompSci)
  • addNode (person, x x ? mathematician x
    x ? compScientist)
  • addNode (avgDeptSalary, avg s (m,s)?_,
    mathematician, salary
  • avg s (c,s)?_, compScientist,
    salary)
  • addEdge (_, dept, person, ( Maths, x) x ?
    mathematician
  • (CompSci, x) x ? compScientist)
  • addEdge (_, person, salary, _,
    mathematician,salary_, compScientist,
    salary)
  • addEdge (_, dept, avgDeptSalary,(Maths, avg
    s (m,s)?_,mathematician,salary),
  • (CompSci, avg s (c,s)?_, compScientist,
    salary))
  • delEdge (_, mathematician, salary, (p, s)
    (d, p) ? _, dept, person
  • (p, s) ? _, person, salary d Maths p
    p)
  • delEdge (_, compScientist, salary, (p, s)
    (d, p) ? _, dept, person
  • (p, s)? _, person, salary d CompSci
    p p)
  • delNode (mathematician, p (d, p)? _, dept,
    person d Maths)
  • delNode (compScientist, p (d, p) ? _, dept,
    person d CompSci)

13
Research Aims
  • Develop techniques for the two topics based on
    simple IQL queries
  • Investigate how the Automed transformation
    pathways can be used for both of them
  • Design and Implement procedures for IVM and DLT
    using Automed transformation pathways
  • Explore the relationship between the two
    processes, and
  • Derive an integrated approach to combine the two
    processes

14
Research Approach
  • Consider simple formulae and methods for simple
    IQL queries
  • Investigate individual transformation step in a
    transformation pathway
  • Design entire procedures using the whole Automed
    transformation pathways and Implement them
  • Handle the situation with arbitrary IQL queries

15
Preliminary ideas results achieved so far
  • Data Lineage Tracing in Automed
  • Incremental View Maintenance In Automed

16
Incremental View Maintenance in AutoMed
  • ?D the tuples inserted into base relation, D,
  • ?D the tuples deleted from base relation, D,
  • ?V the tuples inserted into the materialized
    view, V,
  • ?V the tuples deleted from the materialized
    view, V,
  • (?D, ?D, ?V and ?V are possibly empty.)
  • (Minimality Condition) ?V and ?V should satisfy
    the following minimality conditions
  • (a) ?V ? V we only delete tuples that are in
    V
  • (b) ?V n ?V Ø we do not delete a tuple and
    then reinsert it.

17
Incremental view maintenance with simple IQL
queries (1-3)
18
Incremental view maintenance with simple IQL
queries (2-3)
19
Incremental view maintenance with simple IQL
queries (3-3)
20
Incremental view maintenance using Automed
transformation pathways
S
GS
TP tp1, , tpr
D
V
21
Data Lineage Tracing in Automed
Classification of Data Lineage
  • Affect-provenance
  • which includes all of the source data that had
    some influence on the result data
  • Origin-provenance
  • that is the specific data in the source
    databases from which the resulting data is
    exactly extracted.

22
Data Lineage with set semantics in IQL
Affect-set for a simple query in IQL ts
affect-set in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are maximal subsets of T1, , Tm such
that (a)   q(T1, , Tm) t (b)  ?Ti
q(T1, , Ti, , Tm) t ? Ti ? Ti (c)
?Ti ?t ? Ti q(T1, , t, ,
Tm) ? Ø
Origin-set for a simple query in IQL ts
origin-set in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are minimal subsets of T1, , Tm such
that (a)   q(T1, , Tm) t (b)  ?Ti Ti ?
Ti q(T1, , Ti, , Tm) ? t (c)
?Ti ?t ? Ti q(T1, , t, ,
Tm) ? Ø
23
For Example
Set T1 1, 2, 3, T2 3, 4, 5, V T1 --T2
1, 2. We obtain 1s affect-set as follows 1.
To satisfy the conditions (a) and (b), T1
1, 3 T2 3, 4, 5 2. ? 3 -- 3, 4, 5
Ø, not satisfying (c) ? delete 3 fromT1,
then T1 1 3. The affect-set of 1 (1 ? V)
is T1 1 T2 3, 4, 5
24
Data Lineage with bag semantics in IQL
Affect-Pool for a simple query in IQL ts
affect-pool in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are maximal subsets of T1, , Tm such
that (a)  q(T1, , Tm) x x ? T x t
(b)  ?Ti q(T1, , Ti, , Tm) x x
? T x t ? Ti ? Ti (c) ?Ti ?t ? Ti
q(T1, , t, , Tm) ? Ø
Origin-Pool for a simple query in IQL ts
origin-pool in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are minimal subsets of T1, , Tm such
that (a)  q(T1, , Tm) x x ? T x
t (b)  ?Ti ?t ? Ti q(T1, , x x ? Ti x
? t, , Tm) ? x x ? T x t (c) ?Ti
?t ? Ti q(T1, , t, , Tm) ? Ø
(d) ?Ti ??t t ? Ti, t ? (Ti -- Ti)
25
Affect- and Origin-pool for a tuple with IQL
simple queries

26
Integrating AutoMed primitive transformation step
  • addNode / addEdge ? addConstruct
  • delNode / delEdge ? delConstruct
  • renameNode / renameEdge ? renameConstruct
  • ignore addConstraint /delConstraint

27
Pre-analysis of the data Lineage problem for each
Automed transformation step
a) For an addConstruct(O, q) transformation
The lineage of data in the schema construct O
is located in the constructs that appear in the
query q. b) For an renameConstruct(O, O)
transformation The lineage of data in the
schema construct O is located in the source
construct O. c) All delConstruct(O, q)
transformations can be ignored since they create
no schema construct.
28
Attributes for each input item
t the tracing tuple. O a construct in
integrated schema GS relateTP the
transformation step that created O. extent
the current extent of O. tp a transformation
step in the transformation pathway
opreatorType add , del or ren query
the query used in tp (if any) source
O (for renameConstruct (O ,O)) or all
constructs appearing in the q (for addConstrucr
(O, q)) result O (for renameConstruct (O
,O) and addConstrucr (O, q))
29
The Outputs of procedures
  • The result of all procedures is
  • a sequence of pair
  • lt(D1, O1), , (Dn, On)gt
  • In which (t is the tracing tuple)
  • Di a bag contains ts derivation
  • Oi the construct whose extent contains Di

30
Tracing derivation Algorithm for a tuple
procedure affectPoolofTuple(t, O) begin
if (O.realteTP Ø) DL lt(t, O)gt else
D (O.extent O) O ? O.relateTP.source
D TQAPD(t) DL (B, B.construct) B?
D return(DL) end
procedure originPoolofTuple(t, O) begin
if (O.realteTP Ø) DL lt(t, O)gt
else D (O.extent O) O ?
O.relateTP.source D TQOPD(t) DL (B,
B.construct) B? D return(DL) end
Tracing affect-pool for a tuple t
Tracing origin-pool for a tuple t
31
Tracing derivation Algorithm for a set of tuples
procedure affectPoolOfSet(T, O) input a tracing
tuple set T t1, , tn, the construct O which
contains tuple set T. output Ts affect pool,
DL begin DL ltgt //the empty
sequence for each t ? T do DL
merge(DL, affectPoolOfTuple(ti, O))
return(DL) end
Derivation Tracing Procedures for a set of tuples
32
The merge procedure
proc merge(DL, DLnew) input data lineage
sequence DL lt(D1, O1) , (Dn, On)gt new data
lineage sequence DLnew output merged data
lineage sequence DL begin for each
(Dnew, Onew) ? DLnew do if Onew Oi for some Oi
in DL then oldData Di
newData oldData x x ?
Dnew not (member oldData x) DL (DL
(oldData, Oi)) (newData, Oi) else DL
DL (Dnew, Onew) return(DL) end
33
Algorithm for tracing derivation through Automed
transformation pathways
procedure traceAffectPool(B, O) begin DL lt(B,
O)gt //initiating the DL for j r downto 1
do case (tpj.transfType del)
continue case (tpj.transfType
ren) if tpj.result Oi
for some Oi in DL then
DL (DL (Di, Oi)) (Di, tpj.source)
case (tj.transfType add)
if tpj.result Oi for some Oi in DL
then DL DL
(Di, Oi) Di
sortDistinct Di
DL merge(DL, affectPoolOfSet(Di, Oi))
endfor return(DL) end
Derivation Tracing Procedures for entire
transformation pathways
34
Further work (1)
  • In next 9 months for my PhD research, I plan to
  • Handle more complex IQL queries appearing in
    transformation pathways. (1 month)
  • Implement my DLT algorithms. (2 months)
  • Combine my approach for DLT with the problem of
    IVM (2 months)
  • Implement the integrated approach (2 months)
  • Extend my algorithms to a more expressive
    transformation language. (3 months)

35
Further work (2)
  • Compare my approaches with other ones
  • Extend my algorithms for other purposes
  • Propagating security/access control information
    to the integrated database
  • Maintaining view schema using AutoMed
    transformation pathways
  • Extending my algorithms to environment where
    there is not information of the data sources
  • Apply and evaluate my techniques in the area of
    genomic data warehouses, in collaboration with an
    ongoing Bio-informatics project at Birkbeck, UCL
    and EBI ("Structural and Functional Annotation of
    Genomes through Synchronised Data Warehouses").

36
Acknowledge
  • Prof. Alexandra Poulovassilis
  • School of Computer Science Information Systems
  • Birkbeck college, University of London
Write a Comment
User Comments (0)
About PowerShow.com