Tracing Data Lineage Using Schema Transformation Pathways - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Tracing Data Lineage Using Schema Transformation Pathways

Description:

Using Schema Transformation Pathways. Hao Fan Alexandra Poulovassilis ... For a given ware-house data item, how to identify the exact set of source data items. ... – PowerPoint PPT presentation

Number of Views:421
Avg rating:3.0/5.0
Slides: 25
Provided by: Hao99
Category:

less

Transcript and Presenter's Notes

Title: Tracing Data Lineage Using Schema Transformation Pathways


1
Tracing Data Lineage Using Schema Transformation
Pathways
  • Hao Fan Alexandra Poulovassilis
  • School of Computer Science Information Systems
  • Birkbeck college, University of London
  • ECAI Workshop on Knowledege Transformation for
    the Semantic Web 21st 26th July 2002

2
What is Data Lineage
For a given ware-house data item, how to identify
the exact set of source data items.
3
The open problems of Data Lineage Tracing
  • Definition of Data Lineage
  • Derivation of Tracing Queries
  • Data Lineage Tracing procedures
  • Lineage Tracing with set/bag semantics
  • Lineage tracing using auxiliary views

4
Applications
  • On-line Data analysis and mining (OLAP/OLAM)
  • Scientific Databases
  • Data cleaning
  • Authorization management
  • View update problem

5
AutoMed
  • HDM (Hypergraph Data Model)
  • ltNodes, Edges, Constraintsgt
  • Instance / Extension Mapping (Exts,I(c))
  • 8 Primitive transformations
  • Composite transformations / Transformation
    Pathways
  • IQL language

6
Simple IQL queries
  • 7. q gc aggFun D
  • (aggFun max min count sum
    avg)
  • /group and compute a bag of pairs on their
    first component and apply an aggregation function
    to the second component/
  • 8. q p p ? D1 member D2 e /members of D1
    that are members of D2/
  • 9. q p p ? D1 not (member D2 e) /members
    of D1 that are not members of D2/
  • 10. q p ? p1 ? D1 pr ? Dr c1 ck / a
    comprehension/
  • 1. q D1 D2 Dr
  • / bag union/
  • 2. q D1 -- D2
  • / bag monus /
  • 3. q group D
  • / group a bag of pairs on their first
    component/
  • 4. q sort D
  • 5. q sortDistinct D
  • /sort and remove duplicates/
  • 6. q aggFun D
  • (aggFun max min count sum
    avg)

7
Example for transforming between HDM schemas
person
mathematician
compScientist
dept
salary
salary
TS1,S2
avgDeptSalary
Schema S1
Schema S2
8
The Transformation Pathway TS1,S2
  • addNode (dept, Maths,CompSci)
  • addNode (person, x x ? mathematician x x
    ? compScientist)
  • addNode (avgDeptSalary, avg s (m,s)?_,
    mathematician, salary
  • avg s (c,s)?_, compScientist,
    salary)
  • addEdge (_, dept, person, ( Maths, x) x ?
    mathematician
  • (CompSci, x) x ? compScientist)
  • addEdge (_, person, salary, _,
    mathematician,salary_, compScientist,
    salary)
  • addEdge (_, dept, avgDeptSalary,(Maths, avg
    s (m,s)?_,mathematician,salary),
  • (CompSci, avg s (c,s)?_, compScientist,
    salary))
  • delEdge (_, mathematician, salary, (p, s)
    (d, p) ? _, dept, person
  • (p, s) ? _, person, salary d Maths p
    p)
  • delEdge (_, compScientist, salary, (p, s)
    (d, p) ? _, dept, person
  • (p, s)? _, person, salary d CompSci
    p p)
  • delNode (mathematician, p (d, p)? _, dept,
    person d Maths)
  • delNode (compScientist, p (d, p) ? _, dept,
    person d CompSci)

9
Data Lineage Tracing
There are two kinds of Data Lineage
  • Affect-provenance
  • includes all of the source data that had some
    influence on the result data
  • Origin-provenance
  • the specific data in the source databases from
    which the resulting data is extracted.

10
Our approach is to
  • Consider data lineage tracing for simple IQL
    queries
  • Handle one transformation step
  • Extend to an algorithm for whole transformation
    pathways
  • Extend to handle arbitrary IQL queries

11
Data Lineage with set semantics in IQL
Affect-set for a simple query in IQL ts
affect-set in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are maximal subsets of T1, , Tm such
that (a)   q(T1, , Tm) t (b)  ?Ti
q(T1, , Ti, , Tm) t ? Ti ? Ti (c)
?Ti ?t ? Ti q(T1, , t, ,
Tm) ? Ø
Origin-set for a simple query in IQL ts
origin-set in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are minimal subsets of T1, , Tm such
that (a)   q(T1, , Tm) t (b)  ?Ti Ti ?
Ti q(T1, , Ti, , Tm) ? t (c)
?Ti ?t ? Ti q(T1, , t, ,
Tm) ? Ø
12
Example
Set T1 1, 2, 3, T2 3, 4, 5, V T1 --T2
1, 2. We obtain 1s affect-set as follows 1.
To satisfy the conditions (a) and (b), T1
1, 3 T2 3, 4, 5 2. ? 3 -- 3, 4, 5
Ø, not satisfying (c) ? delete 3 fromT1,
then T1 1 3. The affect-set of 1 (1 ? V)
is T1 1 T2 3, 4, 5
13
Data Lineage with bag semantics in IQL
Affect-Pool for a simple query in IQL ts
affect-pool in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are maximal subsets of T1, , Tm such
that (a)  q(T1, , Tm) x x ? T x t
(b)  ?Ti q(T1, , Ti, , Tm) x x
? T x t ? Ti ? Ti (c) ?Ti ?t ? Ti
q(T1, , t, , Tm) ? Ø
Origin-Pool for a simple query in IQL ts
origin-pool in T1, , Tm according to q to be
qAltT1, , Tmgt(t) ltT1, , Tmgt, where T1, ,
Tm are minimal subsets of T1, , Tm such
that (a)  q(T1, , Tm) x x ? T x
t (b)  ?Ti ?t ? Ti q(T1, , x x ? Ti x
? t, , Tm) ? x x ? T x t (c) ?Ti
?t ? Ti q(T1, , t, , Tm) ? Ø
(d) ?Ti ??t t ? Ti, t ? (Ti -- Ti)
14
Affect- and Origin-pool for a tuple with IQL
simple queries

15
Affect- and Origin-pool for a query-sequence
V Q(D) q1?q2??qr(D) qr(qr-1((q1(D))))
ts affect-pool in D according to Q to be
QAPD(t) D, where Di qiAP(Di1) (1 i ?
r), Di1 t and D D1. ts origin-pool
in D according to Q to be QOPD(t) D, where
Di qiOP(Di1) (1 i ? r), Di1 t and
D D1.
16
Analysis of the data lineage problem for each
Automed transformation step
a) For an addConstruct(O, q) transformation
The lineage of data in schema construct O is
located in the constructs that appear in the
query q. b) For a renameConstruct(O, O)
transformation The lineage of data in O is
located in the source construct O. c) All
delConstruct(O, q) transformations can be ignored
since they create no schema construct.
17
Attributes for each input item
t the tracing tuple. O a construct in
integrated schema GS relateTP the
transformation step that created O. extent
the current extent of O. tp a transformation
step in the transformation pathway
opreatorType add , del or ren query
the query used in tp (if any) source
O (for renameConstruct (O ,O)) or all
constructs appearing in the q (for addConstrucr
(O, q)) result O (for renameConstruct (O
,O) and addConstrucr (O, q))
18
Output of the procedures
  • The result of all tracing procedures is
  • a sequence of pair
  • lt(D1, O1), , (Dn, On)gt
  • In which
  • Di a bag containing the derivation
  • Oi the construct whose extent contains Di

19
Tracing derivation for a tuple
procedure affectPoolofTuple(t, O) begin
if (O.realteTP Ø) DL lt(t, O)gt else
D (O.extent O) O ? O.relateTP.source
D TQAPD(t) DL (B, B.construct) B?
D return(DL) end
procedure originPoolofTuple(t, O) begin
if (O.realteTP Ø) DL lt(t, O)gt
else D (O.extent O) O ?
O.relateTP.source D TQOPD(t) DL (B,
B.construct) B? D return(DL) end
Tracing affect-pool for a tuple t
Tracing origin-pool for a tuple t
20
Tracing Affect-Pool for a set of tuples
procedure affectPoolOfSet(T, O) input a tracing
tuple set T t1, , tn, the construct O which
contains tuple set T. output Ts affect pool,
DL begin DL ltgt //the empty
sequence for each t ? T do DL
merge(DL, affectPoolOfTuple(ti, O))
return(DL) end
The procedure originPoolOfSet is similar
21
The merge procedure
proc merge(DL, DLnew) input data lineage
sequence DL lt(D1, O1) , (Dn, On)gt new data
lineage sequence DLnew output merged data
lineage sequence DL begin for each
(Dnew, Onew) ? DLnew do if Onew Oi for some Oi
in DL then oldData Di
newData oldData x x ?
Dnew not (member oldData x) DL (DL
(oldData, Oi)) (newData, Oi) else DL
DL (Dnew, Onew) return(DL) end
22
Algorithm for tracing Affect-Pool through a
transformation pathway
procedure traceAffectPool(B, O) begin DL lt(B,
O)gt //initiating the DL for j r downto 1
do case (tpj.transfType del)
continue case (tpj.transfType
ren) if tpj.result Oi
for some Oi in DL then
DL (DL (Di, Oi)) (Di, tpj.source)
case (tj.transfType add)
if tpj.result Oi for some Oi in DL
then DL DL
(Di, Oi) Di
sortDistinct Di
DL merge(DL, affectPoolOfSet(Di, Oi))
endfor return(DL) end
The procedure traceOriginPool is similar
23
Contributions
  • We have shown how individual steps of schema
    transformation pathways can be used for tracing
    affect-pool and origin-pool
  • Although shown for HDM data model and IQL query
    language, our approach is more generally
    applicable to other data models and query
    languages
  • It is also applicable to inter-model
    transformation pathways
  • In particular, could be applied to data lineage
    tracing for derived information in the Semantic
    Web, to trace the source resources for the
    information

24
Ongoing work
  • Handling more complex IQL queries appearing in
    transformation pathways.
  • Implementing our lineage tracing algorithms.
  • Combining our approach for tracing data lineage
    with the problem of incremental view maintenance.
  • Implementing the integrated approach.
  • Extending the algorithms to a more expressive
    transformation language.
Write a Comment
User Comments (0)
About PowerShow.com