Title: A Graphbased Framework for the Modeling of ETL Activities
1A Graph-based Framework for the Modeling of ETL
Activities
- Panos Vassiliadis1, Alkis Simitsis2, Spiros
Skiadopoulos2, Manolis Terrovitis2, Panos
Georgrantas2 - (1) University of Ioannina
- pvassil_at_cs.uoi.gr
- (2) National Technical University of Athens
- asimi,spiros,mter,pgeor_at_dbnet.ece.ntua.gr
2Outline
- Background Motivation
- Graph Model for ETL activities
- Importance Metrics for ETL activities
- Template Activities
- Conclusions and Future Work
3Outline
- Background Motivation
- Graph Model for ETL activities
- Importance Metrics for ETL activities
- Template Activities
- Conclusions and Future Work
4Extract-Transform-Load (ETL)
5Motivation
DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY
DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY
DS.PS_NEW1
SUPPKEY1
COST
DATE
DS.PS1
SK1
2
A2EDate
DIFF1
Add_SPK1
DS.PS_OLD1
U
rejected
rejected
rejected
Log
Log
Log
DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY
DS.PS_NEW2
DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY
SUPPKEY2
COST
DATESYSDATE
QTYgt0
DS.PS2
AddDate
NotNULL
Add_SPK2
SK2
CheckQTY
DIFF2
DS.PS_OLD2
rejected
rejected
Log
Log
DSA
PKEY, DAY MIN(COST)
DW.PARTSUPP
S1_PARTSUPP
V1
Aggregate1
FTP1
PKEY, MONTH AVG(COST)
DW.PARTSUPP.DATE, DAY
TIME
S2_PARTSUPP
V2
Aggregate2
??
FTP2
Sources
DW
6Motivation
7Motivation
- What part of the scenario is affected if we
delete an attribute? - Which attributes/tables are involved in the
population of attribute DW.PARTSUPP.SUPPKEY ? - How good is my design of the ETL scenario?
8Outline
- Background Motivation
- Graph Model for ETL activities
- Importance Metrics for ETL activities
- Template Activities
- Conclusions and Future Work
9Graph Modeling of ETL Scenarios
- To support the needs of the designer and the
administrator, we model each ETL scenario as a
graph, which we call Architecture Graph - We take advantage of the data-centric,
script-based nature of ETL activities to model
their internals as graphs, too - We give simple visual representations for reasons
of intuition
10Preliminaries
- Data types
- Constants
- Attributes
- RecordSets
- Function types
- Functions
1
PKEY
R
2
my2
11Activities
- Name
- Input Schemata
- Output Schema
- Rejections Schema
- Parameter List
- Output/Rejection Operational Semantics
12Activities
13Relationships
- Instance-Of Relationships
- Part-Of Relationships
- Regulator Relationships
- Provider Relationships
- Derived Provider
- Relationships
14Part-Of Relationships
15Regulator Relationships
16Provider Relationships
17Derived Provider Relationships
IN
IN
OUT
OUT
PAR
OUT
IN
PKEY
PKEY
SOURCE
LU_PKEY
LU_SOURCE
LU_SKEY
18Outline
- Background Motivation
- Graph Model for ETL activities
- Importance Metrics for ETL activities
- Template Activities
- Conclusions and Future Work
19Importance Metrics
- Dependency the in-degree of the node with
respect to the provider edges - Responsibility the out-degree of the node with
respect to the provider edges - Degree dependency responsibility
- Local vs. Transitive
20Importance Metrics
IN
IN
OUT
OUT
LOOKUP_PS
21Answers to motivating questions
- What part of the scenario is affected if we
delete an attribute? - Which attributes/tables are involved in the
population of an attribute?
Transitive dependencies and responsibilities
22Answers to motivating questions
- How good is my design of the ETL scenario?
- Vulnerable points
- Detection of inconsistencies
- Detection of important data stores
- Detection of useless (source) attributes.
23Outline
- Background Motivation
- Graph Model for ETL activities
- Importance Metrics for ETL activities
- Template Activities
- Conclusions and Future Work
24Template Activities
25Template Activities
- Name
- Parameter List
- Expression in LDL (a Datalog variant)
- Mapping a set of bindings, mapping input to
output attributes, possibly through intermediate
placeholders.
26Notation
- Variables to be replaced by their values at
instantiation time - ArityOf() returns the arity of an input, output
or parameter schema - Loops the loop body is repeated at instantiation
time as many times as the iterator constraint
defines - Keywords to simplify the creation of unique
predicate and attribute names.
27Example Function Application
- fa12_out(FA12_OUT_1, FA12_OUT_2, FA12_OUT_3,
OUTFIELD)lt- -
- fa12_in(FA12_IN_1, FA12_IN_2, FA12_IN_3),
-
- f1(FA12_IN_2, FA12_IN_3,OUTFIELD),
-
- FA12_OUT_1FA12_IN_1,FA12_OUT_1FA12_IN_2,
- FA12_OUT_1FA12_IN_3.
FA12_IN_1
FA12_OUT_1
FA12_IN_2
FA12_OUT_2
FA12_IN_3
FA12_OUT_3
OUTFIELD
f1
28Example Function Application
29Example Function Application
a_out(iltarityOf(a_in)1A_OUT_i, OUTFIELD)
lt- a_in(ilt arityOf(a_in) A_IN_i,
iarityOf(a_in) A_IN_i), _at_FUNCTION(ilt
arityOf(_at_PARAMi)1 _at_PARAMi,
OUTFIELD), iltarityOf(a_in) A_OUT_iA_IN_i,
iarityOf(a_in) A_OUT_iA_IN_i
- _at_FUNCTION F1
- _at_PARAM1A_IN_2
- _at_PARAM2A_IN_3
- a_out(A_OUT_1, A_OUT_2, A_OUT_3, OUTFIELD)lt-
- a_in(A_IN_1, A_IN_2, A_IN_3),
- _at_FUNCTION(_at_PARAM1,_at_PARAM2,OUTFIELD),
- A_OUT_1A_IN_1,
- A_OUT_1A_IN_2,
- A_OUT_1A_IN_3.
Parameter Init. Loop production
30Example Function Application
Parameter replacement
a_out(A_OUT_1, A_OUT_2, A_OUT_3,
OUTFIELD)lt- a_in(A_IN_1, A_IN_2,
A_IN_3), f1(A_IN_2, A_IN_3,OUTFIELD) A_OUT_1A_
IN_1, A_OUT_1A_IN_2, A_OUT_1A_IN_3.
31Example Function Application
- fa12_out(FA12_OUT_1, FA12_OUT_2, FA12_OUT_3,
OUTFIELD)lt- -
- fa12_in(FA12_IN_1, FA12_IN_2, FA12_IN_3),
-
- f1(FA12_IN_2, FA12_IN_3,OUTFIELD),
-
- FA12_OUT_1FA12_IN_1,FA12_OUT_1FA12_IN_2,
- FA12_OUT_1FA12_IN_3.
Keyword Renaming
32Loops
- The general form of loops is
- ltsimple constraintgt ltloop bodygt
- where simple constraint has the form
- ltlower boundgt ltcomparison operatorgt ltiteratorgt
ltcomparison operatorgt ltupper boundgt
33Keywords
34Instantiation Order
- arityOf() functions and loop boundaries are
calculated first. - Loop productions are done by instantiating the
appearances of the iterators. This leads to
intermediate results without any loops. - All parameter variables are instantiated.
- Keywords are recognized and renamed.
35Miscellaneous Templates
36Arktos II
37Outline
- Background Motivation
- Graph Model for ETL activities
- Importance Metrics for ETL activities
- Template Activities
- Conclusions and Future Work
38Answers to motivating questions
- What part of the scenario is affected if we
delete an attribute? - Which attributes/tables are involved in the
population of an attribute?
Transitive dependencies and responsibilities
39Answers to motivating questions
- How good is my design of the ETL scenario?
- Vulnerable points
- Detection of inconsistencies
- Detection of important data stores
- Detection of useless (source) attributes.
40On-going/Future Work
- The Arktos II project is aimed towards the
- Conceptual modeling
- Logical modeling
- Optimization
- What-if analysis
- of ETL scenarios
http//www.dblab.ece.ntua.gr/pvassil/projects/ark
tos_II
41Questions
42Graph Transformations
Transformation Zoom_Out Input the architecture
graph G(V,E) and a structured entity A Output a
new architecture graph G(V,E) Begin G
G " node tÎV, s.t. Ø edge (A,t)ÎPo Ù edge
(A,x)ÎPo " edge (t,x)ÎPr Pr Pr
È(t,A)-(t,x) " edge (x,t)ÎPr Pr Pr
È(A,t)-(x,t) " edge (t,x)ÎRr Rr Rr
È(t,A)-(t,x) " node tÎV, s.t. edges
(A,t)ÎPo, (A,x)ÎPo " edge (t,x)ÎPr Pr
Pr -(t,x) " edge (x,t)ÎPr Pr Pr -(x,t)
" edge (t,x)ÎRr Rr Rr -(t,x) remove t
End
43Graph Transformations
Transformation Major_Flow Input the
architecture graph G(V,E) and the set of target
recordsets T. Output a sub graph G(V,E)
containing information for the major flow of data
from sources to targets. Begin Let TO be the
set of attributes of all the recordsets of T
VTÈTO do V"? "tÎV",aÎV,e(a,t)ÎPr
V"V"ÈtEEÈe "tÎV",aÎV,e(a,t)ÎPo
V"V"ÈtEEÈe "tÎV",aÎV,e(t,a)ÎPo
V"V"ÈtEEÈe "tÎV",aÎV,e(a,t)ÎRr
V"V"ÈtEEÈe VVÈV" while V"¹?
End
44Conceptual Model