A Graphbased Framework for the Modeling of ETL Activities - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

A Graphbased Framework for the Modeling of ETL Activities

Description:

A Graph-based Framework for the Modeling of ETL Activities ... Compress/Decompress (Z/dZ) Encrypt/Decrypt (Cr/dCr) Unary operations. Push. Aggregation (?) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 45
Provided by: asi87
Category:

less

Transcript and Presenter's Notes

Title: A Graphbased Framework for the Modeling of ETL Activities


1
A Graph-based Framework for the Modeling of ETL
Activities
  • Panos Vassiliadis1, Alkis Simitsis2, Spiros
    Skiadopoulos2, Manolis Terrovitis2, Panos
    Georgrantas2
  • (1) University of Ioannina
  • pvassil_at_cs.uoi.gr
  • (2) National Technical University of Athens
  • asimi,spiros,mter,pgeor_at_dbnet.ece.ntua.gr

2
Outline
  • Background Motivation
  • Graph Model for ETL activities
  • Importance Metrics for ETL activities
  • Template Activities
  • Conclusions and Future Work

3
Outline
  • Background Motivation
  • Graph Model for ETL activities
  • Importance Metrics for ETL activities
  • Template Activities
  • Conclusions and Future Work

4
Extract-Transform-Load (ETL)
5
Motivation
DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY
DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY
DS.PS_NEW1
SUPPKEY1
COST
DATE
DS.PS1
SK1
2
A2EDate
DIFF1
Add_SPK1
DS.PS_OLD1
U
rejected
rejected
rejected
Log
Log
Log
DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY
DS.PS_NEW2
DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY
SUPPKEY2
COST
DATESYSDATE
QTYgt0
DS.PS2
AddDate
NotNULL
Add_SPK2
SK2
CheckQTY
DIFF2
DS.PS_OLD2
rejected
rejected
Log
Log
DSA
PKEY, DAY MIN(COST)
DW.PARTSUPP
S1_PARTSUPP
V1
Aggregate1
FTP1
PKEY, MONTH AVG(COST)
DW.PARTSUPP.DATE, DAY
TIME
S2_PARTSUPP
V2
Aggregate2
??
FTP2
Sources
DW
6
Motivation
7
Motivation
  • What part of the scenario is affected if we
    delete an attribute?
  • Which attributes/tables are involved in the
    population of attribute DW.PARTSUPP.SUPPKEY ?
  • How good is my design of the ETL scenario?

8
Outline
  • Background Motivation
  • Graph Model for ETL activities
  • Importance Metrics for ETL activities
  • Template Activities
  • Conclusions and Future Work

9
Graph Modeling of ETL Scenarios
  • To support the needs of the designer and the
    administrator, we model each ETL scenario as a
    graph, which we call Architecture Graph
  • We take advantage of the data-centric,
    script-based nature of ETL activities to model
    their internals as graphs, too
  • We give simple visual representations for reasons
    of intuition

10
Preliminaries
  • Data types
  • Constants
  • Attributes
  • RecordSets
  • Function types
  • Functions

1
PKEY
R
2
my2
11
Activities
  • Name
  • Input Schemata
  • Output Schema
  • Rejections Schema
  • Parameter List
  • Output/Rejection Operational Semantics

12
Activities
13
Relationships
  • Instance-Of Relationships
  • Part-Of Relationships
  • Regulator Relationships
  • Provider Relationships
  • Derived Provider
  • Relationships

14
Part-Of Relationships
15
Regulator Relationships
16
Provider Relationships
17
Derived Provider Relationships
IN
IN
OUT
OUT
PAR
OUT
IN
PKEY
PKEY
SOURCE
LU_PKEY
LU_SOURCE
LU_SKEY
18
Outline
  • Background Motivation
  • Graph Model for ETL activities
  • Importance Metrics for ETL activities
  • Template Activities
  • Conclusions and Future Work

19
Importance Metrics
  • Dependency the in-degree of the node with
    respect to the provider edges
  • Responsibility the out-degree of the node with
    respect to the provider edges
  • Degree dependency responsibility
  • Local vs. Transitive

20
Importance Metrics
IN
IN
OUT
OUT
LOOKUP_PS
21
Answers to motivating questions
  • What part of the scenario is affected if we
    delete an attribute?
  • Which attributes/tables are involved in the
    population of an attribute?

Transitive dependencies and responsibilities
22
Answers to motivating questions
  • How good is my design of the ETL scenario?
  • Vulnerable points
  • Detection of inconsistencies
  • Detection of important data stores
  • Detection of useless (source) attributes.

23
Outline
  • Background Motivation
  • Graph Model for ETL activities
  • Importance Metrics for ETL activities
  • Template Activities
  • Conclusions and Future Work

24
Template Activities
25
Template Activities
  • Name
  • Parameter List
  • Expression in LDL (a Datalog variant)
  • Mapping a set of bindings, mapping input to
    output attributes, possibly through intermediate
    placeholders.

26
Notation
  • Variables to be replaced by their values at
    instantiation time
  • ArityOf() returns the arity of an input, output
    or parameter schema
  • Loops the loop body is repeated at instantiation
    time as many times as the iterator constraint
    defines
  • Keywords to simplify the creation of unique
    predicate and attribute names.

27
Example Function Application
  • fa12_out(FA12_OUT_1, FA12_OUT_2, FA12_OUT_3,
    OUTFIELD)lt-
  • fa12_in(FA12_IN_1, FA12_IN_2, FA12_IN_3),
  • f1(FA12_IN_2, FA12_IN_3,OUTFIELD),
  • FA12_OUT_1FA12_IN_1,FA12_OUT_1FA12_IN_2,
  • FA12_OUT_1FA12_IN_3.

FA12_IN_1
FA12_OUT_1
FA12_IN_2
FA12_OUT_2
FA12_IN_3
FA12_OUT_3
OUTFIELD
f1
28
Example Function Application
29
Example Function Application
a_out(iltarityOf(a_in)1A_OUT_i, OUTFIELD)
lt- a_in(ilt arityOf(a_in) A_IN_i,
iarityOf(a_in) A_IN_i), _at_FUNCTION(ilt
arityOf(_at_PARAMi)1 _at_PARAMi,
OUTFIELD), iltarityOf(a_in) A_OUT_iA_IN_i,
iarityOf(a_in) A_OUT_iA_IN_i
  • _at_FUNCTION F1
  • _at_PARAM1A_IN_2
  • _at_PARAM2A_IN_3
  • a_out(A_OUT_1, A_OUT_2, A_OUT_3, OUTFIELD)lt-
  • a_in(A_IN_1, A_IN_2, A_IN_3),
  • _at_FUNCTION(_at_PARAM1,_at_PARAM2,OUTFIELD),
  • A_OUT_1A_IN_1,
  • A_OUT_1A_IN_2,
  • A_OUT_1A_IN_3.

Parameter Init. Loop production
30
Example Function Application
Parameter replacement
a_out(A_OUT_1, A_OUT_2, A_OUT_3,
OUTFIELD)lt- a_in(A_IN_1, A_IN_2,
A_IN_3), f1(A_IN_2, A_IN_3,OUTFIELD) A_OUT_1A_
IN_1, A_OUT_1A_IN_2, A_OUT_1A_IN_3.
31
Example Function Application
  • fa12_out(FA12_OUT_1, FA12_OUT_2, FA12_OUT_3,
    OUTFIELD)lt-
  • fa12_in(FA12_IN_1, FA12_IN_2, FA12_IN_3),
  • f1(FA12_IN_2, FA12_IN_3,OUTFIELD),
  • FA12_OUT_1FA12_IN_1,FA12_OUT_1FA12_IN_2,
  • FA12_OUT_1FA12_IN_3.

Keyword Renaming
32
Loops
  • The general form of loops is
  • ltsimple constraintgt ltloop bodygt
  • where simple constraint has the form
  • ltlower boundgt ltcomparison operatorgt ltiteratorgt
    ltcomparison operatorgt ltupper boundgt

33
Keywords
34
Instantiation Order
  • arityOf() functions and loop boundaries are
    calculated first.
  • Loop productions are done by instantiating the
    appearances of the iterators. This leads to
    intermediate results without any loops.
  • All parameter variables are instantiated.
  • Keywords are recognized and renamed.

35
Miscellaneous Templates
36
Arktos II
37
Outline
  • Background Motivation
  • Graph Model for ETL activities
  • Importance Metrics for ETL activities
  • Template Activities
  • Conclusions and Future Work

38
Answers to motivating questions
  • What part of the scenario is affected if we
    delete an attribute?
  • Which attributes/tables are involved in the
    population of an attribute?

Transitive dependencies and responsibilities
39
Answers to motivating questions
  • How good is my design of the ETL scenario?
  • Vulnerable points
  • Detection of inconsistencies
  • Detection of important data stores
  • Detection of useless (source) attributes.

40
On-going/Future Work
  • The Arktos II project is aimed towards the
  • Conceptual modeling
  • Logical modeling
  • Optimization
  • What-if analysis
  • of ETL scenarios

http//www.dblab.ece.ntua.gr/pvassil/projects/ark
tos_II
41
Questions
42
Graph Transformations
Transformation Zoom_Out Input the architecture
graph G(V,E) and a structured entity A Output a
new architecture graph G(V,E) Begin G
G " node tÎV, s.t. Ø edge (A,t)ÎPo Ù edge
(A,x)ÎPo " edge (t,x)ÎPr Pr Pr
È(t,A)-(t,x) " edge (x,t)ÎPr Pr Pr
È(A,t)-(x,t) " edge (t,x)ÎRr Rr Rr
È(t,A)-(t,x) " node tÎV, s.t. edges
(A,t)ÎPo, (A,x)ÎPo " edge (t,x)ÎPr Pr
Pr -(t,x) " edge (x,t)ÎPr Pr Pr -(x,t)
" edge (t,x)ÎRr Rr Rr -(t,x) remove t
End
43
Graph Transformations
Transformation Major_Flow Input the
architecture graph G(V,E) and the set of target
recordsets T. Output a sub graph G(V,E)
containing information for the major flow of data
from sources to targets. Begin Let TO be the
set of attributes of all the recordsets of T
VTÈTO do V"? "tÎV",aÎV,e(a,t)ÎPr
V"V"ÈtEEÈe "tÎV",aÎV,e(a,t)ÎPo
V"V"ÈtEEÈe "tÎV",aÎV,e(t,a)ÎPo
V"V"ÈtEEÈe "tÎV",aÎV,e(a,t)ÎRr
V"V"ÈtEEÈe VVÈV" while V"¹?
End
44
Conceptual Model
Write a Comment
User Comments (0)
About PowerShow.com