Data Warehouse - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Data Warehouse

Description:

Operational transactions do not usually map one-to-one with ... Recode applications - Compare tables. Dimension table. Time representation - Point in time ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 28
Provided by: Jen5
Category:
Tags: data | recode | warehouse

less

Transcript and Presenter's Notes

Title: Data Warehouse


1
Data Warehouse
  • Chapter 11

2
Multiple Files Problem
  • Added complexity of multiple source files
  • Start simple

Logic to detect Correct source
Multiple Source files
Extracted data
3
Transforming Data from Multiple files
File
File
File
File
File
File
File
File
File
4
Missing Values Problem
  • Solution
  • Ignore
  • Wait
  • Mark rows
  • Extract when time-stamped





A



If NULL then FieldA
5
Duplicate Value Problem
  • Solution
  • SQL self-join techniques
  • RDMBS constrains utilities

SELECT FROM table_a, table_b WHERE
table_a.key()table_b.key UNION SELECT FROM
table_a, table_b WHERE table_a.keytable_b.key()
ACME Inc
ACME Inc
ACME Inc
ACME Inc
ACME Inc
6
Element Names Problem
  • Solution
  • CTAS
  • SQLLoader

Customer
Customer
Client
Contact
Name
7
Element Meaning Problem
All customer details
All details Except name
Customers name
  • Avoid misinterpretation
  • Complex solution
  • Document meaning in metadata

8
Input Format Problem
EBCDIC
ASCII
12373
123-73
9
Referential Integrity Problem
  • Solution
  • SQL anti-join
  • Server constraints
  • Dedicated tools
  • Emp Name Department
  • Smith 10
  • Jones 20
  • Doe 50
  • 6787 Harris 60

Department 10 20 30 40
10
Name and Address Problem
  • No unique key
  • Missing values
  • Personal and commercial names mixed
  • Different addresses for same member
  • Different names and spelling for same number
  • Many names on one line
  • One name on two lines

11
Name and Address Problem
  • Single-field format
  • Multiple-field format

Mr.J.Smith, 100 Main St., Bigtown, County Luth,
23565
Name Street Town County Code Mr.J.Smith 100 Main St. Bigtown County Luth 23565
12
Clean and Organize
  • Create atomic values.
  • Standardize formats.
  • Verify data accuracy.
  • Match with other records.
  • Identify private and commercial addresses and
    inhabitants.
  • Document in metadata.
  • Requires sophisticated tools and techniques

13
Merging Data
  • Operational transactions do not usually map
    one-to-one with warehouse data
  • Data for the warehouse is merged to provide
    information for analysis

Pizza sales/return by day, hour, seconds
Sale 1/2/98 120001 Ham Pizza
10.00
Sale 1/2/98 120002 Cheese Pizza
15.00
Sale 1/2/98 120002 Anchovy Pizza
12.00
Return 1/2/98 120003 Ham Pizza
-12.00
Sale 1/2/98 120004 Sausage Pizza
11.00
14
Merging Data
Sale 1/2/98 120001 Ham Pizza
10.00
Sale 1/2/98 120002 Cheese Pizza
15.00
Sale 1/2/98 120002 Anchovy Pizza
12.00
Return 1/2/98 120003 Ham Pizza
-12.00
Sale 1/2/98 120004 Sausage Pizza
11.00
Sale 1/2/98 120001 Ham Pizza
10.00
Sale 1/2/98 120002 Cheese Pizza
10.00
Sale 1/2/98 120004 Sausage Pizza
11.00
15
Adding a Date Stamp
  • Enables time analysis
  • Label loaded data with a date stamp
  • Add time to fact and dimension data

16
Adding a Date Stamp
Store Table Store_id District_id Time_key
Product Table Product_id Time_key Product_desc
Sales Fact Table Item_id Store_id Time_key Sales_d
ollars Sales_units
Item_Table Item_id Dept_id Time_key
Time Table Week_id Period_id Year_id Time_key
17
Adding a Date Stamp
  • Fact table
  • - Add triggers
  • - Recode applications
  • - Compare tables
  • Dimension table
  • Time representation
  • - Point in time
  • - Time span

18
Adding Keys to Data
1 Sale 1/2/98 120001 Ham Pizza
10.00
2 Sale 1/2/98 120002 Cheese Pizza
15.00
3 Sale 1/2/98 120002 Anchovy Pizza
12.00
4 Sale 1/2/98 120003 Ham Pizza
-12.00
5 Sale 1/2/98 120004 Sausage Pizza
11.00
Data values or artificial keys
dw1 Sale 1/2/98 120001 Ham Pizza
10.00
dw2 Sale 1/2/98 120002 Cheese Pizza
10.00
dw3 Sale 1/2/98 120004 Sausage Pizza
11.00
19
Summarizing Data
  • During extraction on staging area
  • After loading onto the warehouse server

Warehouse database
Operational databases
Staging area
20
Maintaining Transformation Metadata
Contains transformation rules, algorithms, and
routines
Sources Stages Rules
Publish Extract Transform
Load Query
21
Transformation Timing and Location
  • Transformation is performed
  • - Before load
  • - In parallel
  • May be initiated at different points

Unlikely
Probable
Possible
22
Choosing a Transformation Point
  • Workload Network bandwidth
  • Environment Parallel execution
  • CPU use Load window time
  • Disk space User information
  • needs

23
Monitoring and Tracking
  • Transformations should
  • Be self-documenting
  • Provides summary statistics
  • Handle process exceptions

24
Designing Transformation Processes
  • Analysis
  • - Sources and target mappings, business
  • rules
  • - Key users, metadata, grain
  • Design options PL/SQL, replication, custom,
    third-party tools
  • Design issues
  • - Performance
  • - Size of the staging area
  • - Exception handling, integrity maintenance

25
Transformation Tools
  • Purchased
  • SQLLoader
  • In-house developed

26
Data Management, Quality, and Auditing Tools
  • Data management
  • - Innovative Systems
  • - Postalsoft
  • - Vality Technology
  • Data quality and auditing
  • - Innovative Systems
  • - Vality Technology

27
Summary
  • This lesson discussed the following topics
  • Importance of data quality
  • Transformation processes
  • Data transformation issuess
  • Data anomalies
  • Name and address management
  • Tools
Write a Comment
User Comments (0)
About PowerShow.com