Data Warehousing - PowerPoint PPT Presentation

1 / 124
About This Presentation
Title:

Data Warehousing

Description:

Data Warehousing – PowerPoint PPT presentation

Number of Views:1829
Avg rating:3.0/5.0
Slides: 125
Provided by: dalesa
Category:
Tags: aree | data | warehousing

less

Transcript and Presenter's Notes

Title: Data Warehousing


1
DataWarehousing
  • A Look Back, Moving Forward

Dale Sanders June 2005
2
Introduction Warnings
  • Why am I here?
  • Teach
  • Stimulate some thought
  • Share some of my experiences and lessons
  • Learn
  • From you, please
  • Ask questions, challenge opinions, share your
    knowledge
  • Ill do my best to live up to my end of the
    bargain
  • Warnings
  • The pictures in this presentation
  • May or may not have any relevance
  • whatsoever to the topic or slide
  • Mostly intended to break up the monotony

3
Expectation Management
  • My DW Strengths (according to others)
  • I know what not to do as much as I know what to
    do
  • Seen and made all the big mistakes
  • Vision, strategy, system architecture, data
    management DW modeling, complex cultural
    issues, leapfrog problem solving
  • My DW weaknesses
  • My programming skills suck and Im not a DBA
  • Havent written a decent line of code in four
    years!
  • Some might say its been 24 years ?
  • Knowledge of leading products is very rusty
  • Though Im beefing up on Microsoft and Cognos

Within these expectations, make no mistake about
it I know data warehousing ?
4
Todays Material
Giving due credit.
  • I am a good Idea Guy
  • But, ideas are worthless without someone to
    implement and enhance them
  • Steve Barlow, Dan Lidgard, Jon Despain, Chuck
    Lyon, Laure Shull, Kris Mitchell, Peter Hess,
    Nancy McLeod, Ron Gault, Rob Carpenter, Tom
    Robison, Stan Smith, my wife, and many others
  • My greatest strength and blessing
  • The ability to recognize, listen to, and hold
  • onto good people
  • (Knock on wood)
  • My achievements in personal and professional life
  • More a function of those around me than a
    reflection on me

5
DW Best Practices The Most Important Metrics
  • Employee satisfaction
  • Without it, long-term customer satisfaction is
    impossible
  • Customer satisfaction
  • Thats the nature of the Information Services
    career field
  • Some people in our profession still dont get it
  • We are here to serve
  • The Organizational Laugh Metric
  • How many times do you hear laughter in the
    day-to-day operations of your team?
  • It is the single most important vital sign to
    organizational health and business success

6
My Background
  • Three, eight-year chapters
  • Captain, Information Systems Engineer, US Air
    Force
  • Nuclear warfare battle management
  • Force status data integration
  • Intelligence and attack warning data fusion
  • Consultant in several capacities
  • TRW
  • National Security Agency (NSA)
  • Intel New Mexico Data Repository (NMDR)
  • Air Force
  • Integrated Minuteman Data Base (IMDB)
  • Peacekeeper Information Retrieval System (PIRS)
  • Information Technology International (ITI, Inc.)
  • Healthcare
  • Intermountain Health Care Enterprise Data
    Warehouse
  • Consultant to other healthcare organizations
    data warehouses
  • Now at Northwestern University Healthcare System

7
Overview
  • Data warehousing history
  • According to Sanders
  • Why and how did this become a sub-specialty in
    information systems?
  • What have we learned so far?
  • My take on Best Practices
  • Key lessons-learned
  • My thoughts on the most popular authors in the
    field
  • What they contribute, where they detract

8
Data Warehousing History
Newspaper Rock 100 B.C.
American Retail 2005 A.D.
9
What Happened in the Cloud?
  • Stage 1 Laziness
  • Operators grew tired of hanging tapes
  • In response to requests for historical financial
    data
  • They stored data on-line, in unauthorized
    mainframe databases
  • Stage 2 End of the mainframe bully
  • Computing moved out from finance to the rest of
    the business
  • Unix and relational databases
  • Distributed computing created islands of
    information
  • Stage 2.1 The government gets involved
  • Consolidating IRS and military databases to save
    money on mainframes
  • Hey, look what I can do with this data
  • Stage 3 Deming comes along
  • Push towards constant business reengineering
  • Cultural emphasis on continuous quality
    improvement and business
  • innovation drives the need for data

10
The Real Truth
  • Data warehousing is a symptom of a problem
  • Technological inability to deploy single-platform
    information systems that
  • Capture data once and reuse it throughout an
    enterprise
  • Support high-transaction rates (single record
    CREATE, SELECT, UPDATE, DELETE) and analytic
    queries on the same computing platform, with the
    same data, at the same time
  • Someday, maybe we will address the root cause
  • Until then, its a good way to make a living

11
The Ideal Library Analogy
  • Stores all of the books and other reference
    material you need to conduct your research
  • The Enterprise data warehouse
  • A single place to visit
  • One database environment
  • Contents are kept current and refreshed
  • Timely, well-choreographed data loads
  • Staffed with friendly, knowledgeable people that
    can help you find your way around
  • Your Data Warehouse team
  • Organized for easy navigation and use
  • Metadata
  • Data models
  • User friendly naming conventions
  • Solid architectural infrastructure
  • Hardware, software, standards, metrics

12
Cultural Detractors
  • The two biggies
  • The business supported by the data warehouse must
    be motivated by a desire for constant improvement
    and fact-based decision making
  • The data warehouse team falls victim to the
    Politics of Data
  • Through naivety
  • Through misguided motives, themselves

13
Business Culture
I love data!
  • Does your CEO
  • Talk about constant improvement, constantly?
  • Drive corporate goals that are SMART?
  • Specific, Measurable, Attainable, Realistic,
    Tangible
  • Crave data to make better informed decisions?
  • Become visibly, buoyantly excited at a demo for a
    data cube?
  • If so, the success of your data warehouse is
    right around the corner sort of

14
Political Best Practices
  • You will be called a data thief
  • Get used to it
  • Encourage life cycle ownership of the OLTP data,
    even in the EDW
  • You will be called dangerous
  • You dont understand our data!
  • OLTP owners know their data better than you do
    acknowledge it and leverage it
  • You will be blamed for poor data quality in the
    OLTP systems
  • This is a natural reaction
  • Data warehouses raise the visibility of poor data
    quality
  • Use the EDW as a tool for raising overall data
    quality
  • You will be called a job robber
  • EDW is perceived as a replacement for OLTP
    systems
  • Educate people The EDW depends on OLTP systems
    for its existence
  • Stick to your values and pure motives
  • The politics will fade away

15
Data Quality
  • Pitfall
  • Taking accountability for data quality on the
    source system
  • Spending gobs of time and money cleansing data
    before its loaded into the DW
  • Its a never ending, never win battle
  • You will always be one step behind data quality
  • You will always be in the cross-hairs of
  • blame
  • Best Practice
  • Push accountability where it belongs to the
    source system
  • Use the data warehouse as a tool to reveal data
    quality, either good or bad
  • Be prepared to weather the initial storm of blame

16
Measuring Data Quality
  • Data Quality Completeness x Validity
  • Can it be measured objectively?
  • Measuring Completeness
  • Number of null values in a column
  • Measuring Validity
  • Cardinality is a simple way to measure validity
  • We only have four standard regions in the
    business, but we have 18 distinct values in the
    region column.

17
Business Validity
  • How can you measure it? You cant
  • I collect this data from our customers, but I
    have to guess sometimes because I dont speak
    Spanish.
  • This data is valid for trend analysis decisions
    before 9/11/2001, but should not be used after
    that date, due to changes in security
    procedures.
  • You cant use insurance billing and
    reimbursement data to make clinical, patient care
    decisions.
  • This customer purchased four copies of Zamfir,
    Master of the Pan Flute, therefore he loves
    everything about Zamfir.
  • What Amazon didnt know I bought them for my
    mom and her sewing circle.

Where do you capture subjective data quality?
Metadata.
18
The Importance of Metadata
  • Maybe the most over-hyped, underserved area of
    data warehousing common sense
  • Vendors want to charge you big for their
    tools
  • Consultants would like you to think that its the
    Holy Grail in disguise and only they can help you
    find it
  • Authors who have never been in an operational
    environment would have you chasing your tail in
    pursuit of an esoteric, mythological Metadata
    Nirvana
  • Dont listen to the confusing messages! You know
    the answer just listen to your common sense

19
Metadata Keep It Simple!
  • Ultimately, what are the most valuable business
    motives behind metadata?
  • Make data more understandable to those who are
    not familiar with it
  • Data quality issues
  • Data timeliness and temporal issues
  • Context in which is was collected
  • Translating physical names to natural
  • language
  • Make data more findable to those who
  • dont know where it is
  • Organize it
  • Take a lesson from library science and the card
    catalog

20
Metadata on the Web
21
Example Metadata Entry
  • Physical Table Name
  • LKUP.POSTAL_CD_MASTER
  • Plain Language Name
  • Postal Code Master - IHC
  • Description
  • Contains Postal (Zip) codes for the IHC referral
    region and IHC specific descriptions. These
    descriptions allow for specific IHC groupings
    used in various analyses. Data Steward Jim
    Allred, ext. 3518

22
Table Metadata
  • Required Elements
  • Long Name (or English name)
  • Description
  • Semi-optional
  • Source
  • Example
  • Data Steward

23
Column Metadata
  • Required
  • Long Name
  • Description
  • Notable ETL transforms
  • Optional
  • Value Range
  • Data Quality
  • Associated Lookup

24
The Data Model
25
Some Metadata Is Free
  • It can be collected from the data dictionary
  • For example
  • Primary and Foreign Keys
  • Indexed Columns
  • Table Creation Dates

26
The Most Valuable Metadata is Subjective
  • The Human Factor
  • Most metadata is not automatically collected by
    tools because it does NOT exist in that form
  • Interviews with data stewards are the key
  • It can take months of effort to collect initial
    metadata.

27
Holding Feet to the Fire
  • Data architects
  • Responsible for metadata in their subject areas
  • Torturing it from data stewards
  • Firm Principle
  • No new data marts go live without metadata

28
Is Metadata Worth The Effort?
  • Good metadata will push the ROI of a home-hum DW
    into the stratosphere
  • It does for DWing what the Yellow Pages did for
    the business ROI of the telephone

29
It Gets Used
  • At Intermountain Health Care
  • 210 web hits on average each week day
  • (23,000 employees, 2B revenue)

30
Whats New In Metadata
31
Report Quality
  • A function of
  • Data quality
  • How well does the report reflect the intent
    behind the question being asked?
  • This report doesnt make sense. Im trying to
    find out how many widgets we can produce next
    year, based on the last four years production.
  • Thats not what you asked for.
  • SQL and other programming accuracy
  • Statistical validity population size of the data
  • Timeliness of the data relative to the decision
  • Event Correlation
  • Best Practice
  • An accompanying meta-report for every report
    that involves significant, high risk decisions

32
Meta Report
  • A document, associated with a published report,
    which defines the report

33
Repository
  • A central place for storing and sharing
    information about business reports

34
IHC Analysts Use of Meta Reports
35
Meta Report
  • Recommended Elements
  • Business Owner
  • Run Frequency
  • Intended Audience
  • Statistical Tests
  • Software Used
  • Source Code
  • Formulas
  • Relevant Issues Commentary
  • Core Elements
  • Author Information
  • Report Name
  • Report Purpose
  • Data Source(s)
  • Report Methods

36
  • Title
  • Location
  • Author
  • Owner

37
  • Purpose
  • Frequency
  • Audience
  • Data Source(s)

38
  • Selection Criteria
  • Statistics
  • Software
  • Source Code
  • Formulas

39
Whats It Look Like?
40
(No Transcript)
41
Utilization and Creation Rate
42
EDW Operations Best Practices
  • Think Mission Control
  • Mission Control Commander
  • Dan Lidgard EDW Operations
  • Manager
  • Operations metrics
  • Customized ETL Library
  • Schedule of operations
  • Alerting tool
  • Storage strategies / backups
  • Performancemonitoring and tuning

43
IHC Architecture
  • EDW
  • Oracle v 9.2.0.3 on AIX 5.2
  • Storage IBM SAN (shark), gt3T
  • ETL tools
  • Ascentials Data Stage
  • Kornshell (unix), SQL scripts, PL/SQL
  • scripting
  • OLAP MS Analysis Services
  • BI Business Objects (Crystal Enterprise)
  • With a Cube presentation layer
  • Dashboard Visual Minings Net Charts
  • EDW Team 16 FTEs, plus SAs and DBAs

44
CustomizedETL Library
45
History
  • One of the IHC ETL programmers noticed he kept
    doing the same things over and over for all of
    his ETL jobs
  • Rather than copying and pasting this repetitive
    code, he created a library.
  • Now everyone uses it on the team
  • Named the library EDW_UTIL
  • EDW Utilities

46
Implementation
  • Executes via Oracle stored procedures
  • Supported by associated tables to hold data when
    necessary
  • Error table
  • QA table
  • Index table

47
Benefits
  • Provides standardization
  • Eliminates code rewrites
  • Can hide complexities
  • Such as the appropriate way to
  • analyze and gather statistics
  • on tables
  • Very accessible to all of
  • our ETL tools
  • Simply an Oracle stored
  • procedure call

48
Index Management
  • Past process included
  • Dropping the tables indexes with a script
  • Loading the table
  • Creating the indexes with a script
  • The past process resulted in messy
  • scripts to manage and
  • coordinate

49
Index Management
  • New process includes
  • Capturing a tables existing indexes metadata
  • Dropping the tables indexes with a single
    procedure call
  • Loading the table
  • Recreating the indexes with a single
  • procedure call
  • There are no more messy scripts to
  • manage and coordinate
  • No lost indexes were neglected
  • when adding to create index script

50
Index Management
  • Samples
  • IMPORT_SCHEMA_INDEX_DATA
  • IMPORT_TABLE_INDEX_DATA
  • DROP_TABLE_INDEXES
  • CREATE_TABLE_INDEXES

51
Background Table Loads
  • Motive Higher availability ratings
  • Solution
  • Load data into tables which are not accessible to
    end users. A simple rename puts them into
    production.
  • Helps transfer the identical attributes from the
    live to the background table
  • Samples
  • COPY_TABLE_METADATA
  • TRANSFER_TABLE_PRIVS
  • DROP_TABLE_INDEXES
  • CREATE_TABLE_INDEXES(Create on background table,
    identical to production table)

52
Tracking Load Times, Errors, QA
  • We had no idea who was loading what and when
  • Each staff member logged in their own way and for
    their own interest
  • ETL error capturing and QA was difficult
  • We can now capture errors and QA information in
    a somewhat standardized fashion

53
Load Times, Errors, QA
  • Samples
  • BEGIN_JOB_TIME
  • (ex CASEMIX)
  • BEGIN_LOAD_TIME
  • (ex CASEMIX INDEX)
  • END_LOAD_TIME
  • END_JOB_TIME
  • COMPLETE_LOAD_TIME(Begin and end together)
  • LOAD_TIME_ERROR(Alert on these errors)
  • LOAD_TIME_METRICSQA (row counts)

54
Miscellaneous Procedures
  • Motive Hide the gory details and make Oracle
    look pretty
  • Such as Oracles table analyze command
  • Gives consistent application of system wide
    parameters such as
  • A new box with a different number of CPUs
    (parallel slaves)or
  • A new version of Oracle
  • We populate some metadata too, such as last load
    date

55
DW Schedule of Operations
  • Motives
  • Choreography of operations
  • Avoid stepping on each others toes
  • Maximize shrinking windows of opportunity
  • Communicating with customers
  • Whats happening when
  • Challenges to a complete schedule
  • Some loads are ad hoc, not scheduled
  • Users query in an ad hoc fashion
  • Use of a variety of ETL tools
  • Use of a variety of user categories
  • DBA, SA, ETL user, end users
  • Lots of servers to manage
  • Production EDW, Stage EDW, ETL servers, OLAP
    servers, presentation layer servers

56
Solution
  • Focus on operations around the production EDW
  • Still working on all the reporting aspects
  • A sample on the next slide
  • Pull this information out of the load times
    data captured by these ETL library calls
  • BEGIN_JOB_TIME
  • BEGIN_LOAD_TIME
  • END_LOAD_TIME
  • END_JOB_TIME
  • COMPLETE_LOAD_TIME

57
Sample Report
58
DW Alerting Tool
  • Motive Business dashboard alerts
  • Aggregate data alerts, such as, The number of
    patients seen in the ER the last week with chief
    complaint of flu indicates the onset of an
    outbreak.
  • A simple tool was created which sends a text
    email, based on existence of data returned from a
    query
  • Healthcare isnt ready for this, yet
  • But, the EDW Team took advantage of the tool in
    unplanned ways

59
Features
  • Web based
  • Open to all EDW users
  • Flexible run schedules
  • Daily, weekly, every two weeks, monthly,
    quarterly
  • Wakes every 5 minutes
  • Ability to enter query in SQL
  • Alert (email) on 3 situations
  • Query returns data
  • Query returns no data
  • Always

60
User Interface
61
Examples
  • 100 alerts in use
  • Live performance check
  • Every 4 hours
  • Alert Check for inactive sessions
  • holding active slaves
  • Daily
  • Alert Look for any active sessions older than
    72 hours
  • ETL monitoring
  • Alert only if problem
  • Alert on errors logged via the ETL_UTIL
  • library (manage by exception)
  • Alert on existence of bad records
  • captured during ETL

62
Storage and Backup
  • The inherited state of affairs
  • DW running like any OLTP database
  • High end expensive SANs
  • FULL nightly online backups
  • Out of space? Just buy more

63
Nightmare in the Making
  • Exponential growth
  • More data sources
  • More summary tables
  • More indexes
  • No data has yet been
  • purged
  • Relaxed attitude
  • Disk is cheap
  • Reality Disk management is expensive
  • Consequence Negating the cost benefits of
    cheaper storage by becoming careless with the
    commodity

64
Looming Crisis
  • Backups running 16 hours or more
  • Performance degradation witnessed by users
  • Good backups obtained less than 50 of the time
  • Squeezing into load windows
  • Running out of space
  • Gross underestimating
  • Financial cost
  • The system administrators (SAs) quadruple the
    price of disk purchase from the previous budget
    year
  • They included the price of tape drives, etc.

65
Major Changes in Operations
  • Stop performing full daily backups
  • Move towards less expensive disk option
  • IBM offers a several levels of SANs
  • Transfer some disk ownership AND backup
    responsibilities to DW team, away from SAs and
    DBAs
  • EDW team more aware of upcoming space demands
  • EDW team more in tune with which data sets are
    easily recreated from the source and dont need a
    backup

66
Tracking and PredictingStorage Use
67
Changes to Backup Strategy
  • Perform full backup once monthly during downtime
  • Perform no data backup on DEV/STAGE environments
  • Do backup DDL (all code) daily in all
    environments
  • Implement daily incremental backup

68
Daily Incremental Backups
  • Easier said than done
  • Resorted to a table level backup
  • The DW team owns which tables are exported
  • DW team populates a table, the export table
    list with each tables export frequency
  • Populated via an application in development
  • The DBAs run an export based on the export
    table list

69
Use Cheaper Disk
  • General practice You can take greater risks with
    DW reliability and availability vs. OLTP systems
  • Use it to your advantage
  • Our SAN vendor (IBM) offers a few levels of SANs.
    Next level down is a big step down in price,
    small step down in features.
  • Feature loss
  • Read cache (referring to disk cache, not box
    memory).
  • We rarely read the same thing twice anyway
  • No phone home to IBM (auto paging)
  • Mean time to failure is higher, but still
    acceptable

70
Performance Monitoring Tuning
  • Err on the side of freedom and empowerment
  • How much harm can really be done?
  • Wed rather not constrain our customers
  • Pounding queries will find their way to
    production
  • Opportunity to educate users
  • Opportunity to tune underlying structures

71
Performance Focus Areas
  • Indexing
  • Well-defined criteria for when and how to apply
    indexes
  • Is this a lost art?
  • Big use of BITMAPS
  • Composite index trick (acts like a table)
  • Partitioning for performance, rather than data
    management
  • Exploiting Oracles Direct Path INSERT feature
  • Avoiding UPDATE and DELETE commands
  • Copy with MINUS instead
  • Implementing Oracle's Parallel Query
  • Turn off referential integrity in the DW.. no
    brainer
  • Thats the job of the source system

72
DW Monitoring Empowering End Users
  • Motive
  • Too many calls from end users about their queries
  • Please kill it.
  • Is it still running or is my PC locked up?
  • Why is the DW so slow?
  • Give them the insight and tools
  • Give them the ability to kill their own queries
  • Still in the works

73
The Insight
74
Tracking Long-Running Queries
  • At IHC, we used Pinecone (from Ambeo) to monitor
    all queries and the SQL
  • Each week, review the longest queries
  • Typical outcome?
  • Add indexes
  • Denormalize
  • Contact the user
  • and assist them with
  • writing a better query

75
The DW Sandbox
  • Motives
  • Lots of little MS Access
  • databases containing
  • valuable data, spread
  • all over the place
  • Analysts were joining
  • with DW data
  • Costly to maintain
  • Security issues
  • PC hogs
  • Solution
  • Provide customers with their own sandbox on the
    DW, with DBA-like privs

76
Sandbox Freedom
  • Web based tool for creating tables and loading
    disparate data to the DW
  • Simple, easy to use interface
  • Privileges
  • Users have full rights to the
  • tables they create
  • Can grant rights to others
  • Big, big victory for customer service and data
    maturity
  • 10 of DW customers use the Sandbox
  • About 600 tables in use now
  • About 2G of data

77
Best Practices in the Design Phase
78
Design-Build
  • Design horizontally, build vertically
  • Start by building data marts that address
    analytic needs in one area of the business with a
    fairly limited data set
  • But, design with the horizontal needs of the
    company in mind, so that you will eventually
    tie all of these vertical data marts together
    with a common semantic layer, or data bus

79
Creating Value In Both Axes
Design
Build
80
For Example
Oncology Data Integration Strategy
Top down reporting requirements and data model
An Integrated Analytic Data Model For Cancer Care
Disparate Sources connected semantically to the
data bus
Cancer Registry
Mammography Radiology
Pathology
Laboratory
Continuing Care And Follow-Up
Quality of Life Survey
Radiation Therapy
Health Plans Claims
Ambulatory Casemix
Acute Care Casemix
81
Development Process
Architectural Data Flow Diagram
Logical Data Model
Physical Data Model
Analytic Reporting Requirements
ETL Requirements
Source-to-Target Data Map and Logic
ETL Programming
Analytic Reporting System
Data Processing Requirements
Analytic Application Development
End User Interface Options
82
The Logic Layer in Data Warehouses
Transaction Systems
Data Layer
Logic Layer
Presentation Layer
Analytic Systems
Source System
Data Warehouse
ETL Process
Reports
Not Here
Here
83
Evidence of Business Process Alignment
  • Map out your high level business process
  • Dont fall prey to analysis paralysis with
    endless business process modeling diagrams!
  • Identify and associate the transaction systems
    that support those processes
  • Identify the common, overlapping semantics/data
    attributes and their utilization rates
  • Build your data marts within an enterprise
    framework
  • that is aligned with the processes
  • you are trying to understand

Alignment
84
For example
AP/AR
Claims Processing
Healthcare business process
Diagnosis
Health Need
Patient Perception
Procedure
Results Outcomes
Supported by non-integrated data in Transaction
Systems
HELP
Lab
HPI MC400
AS400
IDX
HDM
CIS/CDR
HNA
Rx
Survey
Integrated in the Data Warehouse
Data Warehouse
85
Event Correlation
  • A leading edge Best Practice
  • The third dimension to rows and columns
  • Overlays the data that underlies a report or
    graph
  • In 2004, we experienced a drop in revenue as a
    result of the earthquake that destroyed our plant
    in the Philippines.
  • In January of 2005, we saw a spike in the North
    America market for snow shovel sales that
    coincided with an increase in sales for pain
    relievers. This correlates to the record
    snowfall in that region and should not be
    considered a trend. Barring major product
    innovation, we consider the market for snow
    shovels in this area as saturated. Sales will be
    slow for the next several years.

86
Standardizing Semantics
  • Sweet irony are the many synonyms for standard
    semantics
  • Data dictionary
  • Vocabulary
  • Dimensions
  • Data elements
  • Data attributes
  • The bottom line issue Standardizing the terms
    you use to describe key facts about your business

87
Standardizing Names of Things
  • You better do it within the first two months of
    your data warehouse project
  • If you are beyond that point, you better stop and
    do it now, lest you pay a bigger price later
  • Dont
  • Push the standard on the source systems, unless
    its easy to accomplish
  • This was one of the common pitfalls of early data
    warehousing project failures
  • Try to standardize everything under
  • the sun!
  • Focus on the high value facts

88
Where Are The High Value Semantics?
In the high-overlap, high-utilization areas
Source System X
89
Another Perspective
Semantic Overlap
Semantic Utilization
90
The Standard Semantic Layer
91
Data Modeling
  • Star schemas are great and simple, but they
    arent the end-all, be-all of analytic data
    modeling
  • Best practices Do what makes sense dont be a
    schema bigot
  • Ive seen great analytic value from 3NF models
  • Maintain data familiarity for your customers
  • When meeting vertical needs
  • Dont make massive changes to the way the model
    looks and feels, nor the naming conventions you
    will alienate existing users of the data
  • Use views to achieve new or standards-compliant
    perspectives on data
  • When meeting horizontal needs

92
For Example
Source perspective
DW perspective
Horizontal data customer
93
The Case For Timely Updates
Generally, to minimize Total Cost of Ownership
(TCO), your update frequency should be no
greater than the decision making cycle associated
with the data. But everyone wants more timely
data.
94
Best Practice Measure Yourself
The Data Warehouse Dashboard
  • Total number of query-able tables
  • Total number of query-able columns
  • Number of users
  • Average rows delivered per month
  • Storage utilization
  • CPU utilization
  • Downtime per month by data mart
  • Employee satisfaction
  • Customer satisfaction
  • Average number of queries/month
  • Number of queries above a threshold (30 minutes?)
  • Average query response time
  • Total number of records

95
Other Best Practices
  • The Data Warehouse Information Systems Team
    reports to the CIO
  • Most data analysts can and probably should report
    to the business units
  • Change management/service level agreements with
    the source systems
  • No changes in the sources systems unless they are
    coordinated with the data warehouse team

96
More Best Practices
  • Skills of the Data Warehouse IS Team
  • Experienced chief architect/project manager
  • Procedural/script programmers
  • SQL/declarative programmers
  • Data warehouse storage management architects
  • Data warehouse hardware architects and system
    administrators
  • Data architects/modelers
  • DBAs

97
More Best Practices
  • Evidence of project collaboration
  • A cross section of members and expertise from the
    data warehouse IS team
  • Statisticians and data analysts who understand
    the business domain
  • A customer that understands the process(es) being
    measured and can influence change
  • A data steward usually someone from the front
    lines who knows how the data is collected
  • Project Complex reports or a data mart

98
More Best Practices
  • When at all possible, always extract as close to
    the source as possible

99
The Most Popular Authors
  • I appreciate
  • The interest they stir
  • The vocabulary semantics of this new specialty
    that they helped create
  • The downside
  • The buzzwords that are more buzz than substance
  • Corporate Information Factories
  • Endless, meaningless debate
  • Thats not an Operational Data Store!
  • Do you follow Kimball or Inmon?
  • Follow your own common sense
  • Most of these authors have not had to build a
    data warehouse from scratch and live with their
    decisions through a complete lifecycle

100
ETL Operations
  • Besides the cultural risks and challenges, the
    riskiest part of a data warehouse
  • Good book
  • Westerman, WalMart Data Warehousing
  • The Extract, Transform, and Load
  • processes
  • Worthy of its own Best Practices discussion
  • Suffice to say, mitigate risks in this area
    carefully and deliberately
  • The major design errors dont show up until late
    in the lifecycle, when the cost of repair is great

101
Two Essential ETL Functions
  • Initial loads
  • How far back do we go in history?
  • Maintenance loads
  • Differential loads or total refresh?
  • How often?
  • You will run and tune these processes several
    times before you go into production
  • How many records are we dealing with?
  • How long will this take to run?
  • Whats the impact on the source system
    performance?

102
Maintenance Loads
  • Total refresh vs. Incremental loads
  • Total refresh Truncate and reload everything
    from the source system
  • Incremental Load only the new and updated
    records
  • For small data sets, a total refresh strategy is
    the easiest to implement
  • How do you define small? You will know it when
    dont see it.
  • Sometimes the fastest strategy when you are
    trying to show quick results
  • Grab and go

103
Incremental Loads
  • How do we get a snapshot of the data that has
    changed since the last load?
  • Many source systems will have an existing log
    file of some kind
  • Take advantage of these when you can, otherwise
    incremental loads can be complicated

104
File Transfer Formats
  • Design your extract so that it uses
  • Fixed, predetermined length for all records and
    fields
  • Avoid variable length if at all possible
  • A unique character that separates each field in a
    record, such as
  • A standard format for header records across all
    source systems
  • Such as the first three records in each file
  • Include name of source system,
  • file, and record count and
  • number of fields in the
  • record
  • This will be handy for
  • monitoring jobs and
  • collecting load metadata

105
Benefits of Standard File Transfer Format
  • Compatible with standard database and operating
    system utilities
  • Dynamically create initial and maintenance load
    scripts
  • Read the table definitions (DDL) then merge that
    with the standard transfer file format
  • Dynamically generate load monitoring data
  • Read the header row, insert that into a Load
    Status table with status Running, of
    records, start time
  • At EOF, change status to Complete and capture
    end of load time
  • I wish I would have thought about this topic
    more, and earlier in my career

106
Westerman Makes A Good Point
  • If you design for instantaneous updates from the
    beginning, it translates to less ETL maintenance
    and labor time for the EDW staff, later
  • My experience ETL is the least tasteful and
    productive use of a veteran EDW Team member, so I
    like Westermans insight on this topic

107
Messaging Applied to ETL
  • Motive behind the concept
  • Continuous updates in a mixed workload
    environment
  • Mixed Analytical processing at the same time as
    transaction oriented, constant updates, deletes,
    inserts
  • Basic concepts
  • Use a load message queue for records that need to
    be updated, coming from the source systems
  • When the EDW analytical processing workload is
    low (off-peak), pick the next message off the
    load queue and load the data
  • Run this in parallel so that you can process
    several load messages at the same time while you
    have a window of opportunity
  • Sometimes called throttling
  • Speed up and slow down based upon traffic
    conditions

108
ETL Message Queue Process
109
Four Data Maintenance Processes
  • Initial load
  • Loading into an empty table
  • Append load
  • Update process
  • Delete process
  • As much as practical, use your database utilities
    for these processes
  • Study and know your database utilities for data
    warehousing they are getting better all the time
  • I see some bad strategies in this area i.e.,
    companies spending time building their own
    utilities.

110
A Few Planning Thoughts
  • Understand the percentage of records that will be
    updated, deleted, or inserted
  • Youll probably develop a different process for
    90 inserts vs. 90 updates
  • Logging
  • In general, turn logging off during the
    processes, if logging was on at all
  • Field vs. Record level updates
  • Some folks, in the interest of purity, will build
    complex update processes for passing only field
    (attribute) level changes
  • No brainer Pass the whole record

111
Initial Load
  • Every table will, at some time, require an
    initial load
  • For some tables, it will be the best choice for
    data maintenance
  • Total data refresh
  • Best for small tables
  • Simple process to implement
  • Simply delete (or truncate) and reload with fresh
    data

112
A Better Initial Load Process
  • Background load
  • Safer protects against corrupt files
  • Higher availability to customers
  • Three or four steps maybe 6?
  • Create a temporary table
  • Load the temporary table
  • Run quality checks
  • Rename the temporary table to the production
    table name
  • Delete the old table
  • Regrant rights, if necessary
  • Westerman You want to use as many initial load
    processes as possible.
  • I agree!

113
Append Load
  • For larger tables that accumulate historical data
  • There are no updates, just appends
  • A hard fact that will not change
  • Example
  • Sales that are closed
  • Lab results

114
Append Load Options
  • Load a single part of a table
  • Load a partition and attach it to the table
  • Create a new, empty partition
  • Load the new records
  • Attach the partition to the table
  • Look to use the LOAD APPEND command in your
    database

115
Another Append Option
  • Create a temp table identical to the one you are
    loading
  • Load the new records into the empty temp table
  • Issue INSERT/SELECT
  • INSERT INTO Big_Table (SELECT FROM
    Temp_Big_Table)
  • Delete the temp table

IF RECORDS IN TEMP IS MUCH lt OF RECORDS IN
BIG THEN GOOD TECHNIQUE ELSE NOT GOOD
116
Update Process
  • The most difficult and risky to build
  • Use this process only if the tables are too large
    for a complete refresh, Initial Load process
  • Updates affect data that changes over time
  • Like Purchase Orders, hospital transactions, etc.
  • Medical records, if you treat the data
    maintenance at the macroscopic level

117
Update Process Options
  • Simple process
  • Separate the affected records into an update
    file, insert file, or delete file
  • Do this on the source system, if possible
  • Transfer the files to the data warehouse staging
    area
  • Create and run two processes
  • A delete process for deleting the records in the
    production table that need updated or deleted
  • An insert process for inserting the entirely new
    updated record into the production table, as
    well as the true inserts
  • Simple, but typically not very fast

118
Simple Process
  • Delete Process identifies records for deletion
    from the Production Table based upon contents of
    the Updates file.
  • Delete Process identifies records for deletion
    from the Production Table based upon contents of
    the Deletes file.
  • Delete process deletes records from Production
    Table.
  • Insert Process identifies records for insert to
    the Production Table based upon contents of the
    Updates file.
  • Insert Process identifies records for insert to
    the Production Table based upon contents of the
    Inserts file.
  • Insert Process inserts records into the
    Production Table.

Source System
Updated records
Deleted records
New records
EDW Staging Area
EDW Production Table
Delete Process
3
1
Updates
2
4
Deletes
Insert Process
6
5
Inserts
119
When You Are Unsure
  • Sometimes, source system log and audit files make
    it difficult to know if a record was updated or
    inserted (i.e. created)
  • Try this
  • Load the records into a temp table that is
    identical to the production table to be updated
  • Delete corresponding records from the production
    table
  • DELETE FROM prod_table WHERE key_field
  • IN (SELECT temp_key_field FROM temp_table)
  • Insert all the records from the temp table into
    the production table
  • Most databases now support this with an UPSERT

120
Massive Deletes
  • Just as with Updates and Inserts, the number of
    Deletes you have to manage is inversely
    proportional to the frequency of your ETL
    processes
  • Infrequent ETL Massive data operations
  • Partitions work well for this, again
  • E.g., keeping a 5 year window of data
  • Insert most recent year with a partition
  • Delete the last years partition
  • Blazing fast!

Delete partition
1
2
3
4
5
Insert partition
121
Raw Data Standards for ETL
  • Makes the process of communicating with your
    source system partners much easier
  • Data type (e.g., format for date time stamps)
  • File formats (ASCII vs. EBCDIC)
  • Header records
  • Control characters
  • Rule of thumb
  • Never transfer data at the binary level unless
    you are transferring between binary compatible
    computer systems
  • Use only text-displayable characters
  • Less rework time vs. Less storage space and
    faster transfer speed
  • Storage and CPU time are cheap compared to labor

122
Last ThoughtIndexing Strategies
  • Define these early, practice them religiously,
    use them extensively
  • This is Database Design 101
  • Dont fall prey to this most common
  • performance problem!

123
In Summary
  • Best Practices
  • They are as much about what you should do, as
    what you should not do
  • Riskiest areas of data warehousing
  • Poor risk management
  • Risk Probability x Consequence
  • Data quality and ownership politics lead to
    project quagmire
  • Lack of data warehousing skills on the IS team
  • Bad basic old IS operations around the DW
  • Like storage, backup, indexing, managing by
    metrics
  • Exploding and imploding ETL
  • A corporate culture that doesnt care about
    improvement

124
My Thanks
  • For being invited
  • For your time and attention
  • For the many folks who have worked for and with
    me over the years that made me look better as a
    result
  • Please contact me if you have any questions
  • dsanders_at_nmff.org
  • PH 312-695-8618
Write a Comment
User Comments (0)
About PowerShow.com