Data Warehousing - PowerPoint PPT Presentation


PPT – Data Warehousing PowerPoint presentation | free to view - id: 135b5c-Y2M1Y


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Data Warehousing


Data Warehousing – PowerPoint PPT presentation

Number of Views:1763
Avg rating:3.0/5.0
Slides: 125
Provided by: dalesa
Tags: aree | data | warehousing


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Warehousing

  • A Look Back, Moving Forward

Dale Sanders June 2005
Introduction Warnings
  • Why am I here?
  • Teach
  • Stimulate some thought
  • Share some of my experiences and lessons
  • Learn
  • From you, please
  • Ask questions, challenge opinions, share your
  • Ill do my best to live up to my end of the
  • Warnings
  • The pictures in this presentation
  • May or may not have any relevance
  • whatsoever to the topic or slide
  • Mostly intended to break up the monotony

Expectation Management
  • My DW Strengths (according to others)
  • I know what not to do as much as I know what to
  • Seen and made all the big mistakes
  • Vision, strategy, system architecture, data
    management DW modeling, complex cultural
    issues, leapfrog problem solving
  • My DW weaknesses
  • My programming skills suck and Im not a DBA
  • Havent written a decent line of code in four
  • Some might say its been 24 years ?
  • Knowledge of leading products is very rusty
  • Though Im beefing up on Microsoft and Cognos

Within these expectations, make no mistake about
it I know data warehousing ?
Todays Material
Giving due credit.
  • I am a good Idea Guy
  • But, ideas are worthless without someone to
    implement and enhance them
  • Steve Barlow, Dan Lidgard, Jon Despain, Chuck
    Lyon, Laure Shull, Kris Mitchell, Peter Hess,
    Nancy McLeod, Ron Gault, Rob Carpenter, Tom
    Robison, Stan Smith, my wife, and many others
  • My greatest strength and blessing
  • The ability to recognize, listen to, and hold
  • onto good people
  • (Knock on wood)
  • My achievements in personal and professional life
  • More a function of those around me than a
    reflection on me

DW Best Practices The Most Important Metrics
  • Employee satisfaction
  • Without it, long-term customer satisfaction is
  • Customer satisfaction
  • Thats the nature of the Information Services
    career field
  • Some people in our profession still dont get it
  • We are here to serve
  • The Organizational Laugh Metric
  • How many times do you hear laughter in the
    day-to-day operations of your team?
  • It is the single most important vital sign to
    organizational health and business success

My Background
  • Three, eight-year chapters
  • Captain, Information Systems Engineer, US Air
  • Nuclear warfare battle management
  • Force status data integration
  • Intelligence and attack warning data fusion
  • Consultant in several capacities
  • TRW
  • National Security Agency (NSA)
  • Intel New Mexico Data Repository (NMDR)
  • Air Force
  • Integrated Minuteman Data Base (IMDB)
  • Peacekeeper Information Retrieval System (PIRS)
  • Information Technology International (ITI, Inc.)
  • Healthcare
  • Intermountain Health Care Enterprise Data
  • Consultant to other healthcare organizations
    data warehouses
  • Now at Northwestern University Healthcare System

  • Data warehousing history
  • According to Sanders
  • Why and how did this become a sub-specialty in
    information systems?
  • What have we learned so far?
  • My take on Best Practices
  • Key lessons-learned
  • My thoughts on the most popular authors in the
  • What they contribute, where they detract

Data Warehousing History
Newspaper Rock 100 B.C.
American Retail 2005 A.D.
What Happened in the Cloud?
  • Stage 1 Laziness
  • Operators grew tired of hanging tapes
  • In response to requests for historical financial
  • They stored data on-line, in unauthorized
    mainframe databases
  • Stage 2 End of the mainframe bully
  • Computing moved out from finance to the rest of
    the business
  • Unix and relational databases
  • Distributed computing created islands of
  • Stage 2.1 The government gets involved
  • Consolidating IRS and military databases to save
    money on mainframes
  • Hey, look what I can do with this data
  • Stage 3 Deming comes along
  • Push towards constant business reengineering
  • Cultural emphasis on continuous quality
    improvement and business
  • innovation drives the need for data

The Real Truth
  • Data warehousing is a symptom of a problem
  • Technological inability to deploy single-platform
    information systems that
  • Capture data once and reuse it throughout an
  • Support high-transaction rates (single record
    CREATE, SELECT, UPDATE, DELETE) and analytic
    queries on the same computing platform, with the
    same data, at the same time
  • Someday, maybe we will address the root cause
  • Until then, its a good way to make a living

The Ideal Library Analogy
  • Stores all of the books and other reference
    material you need to conduct your research
  • The Enterprise data warehouse
  • A single place to visit
  • One database environment
  • Contents are kept current and refreshed
  • Timely, well-choreographed data loads
  • Staffed with friendly, knowledgeable people that
    can help you find your way around
  • Your Data Warehouse team
  • Organized for easy navigation and use
  • Metadata
  • Data models
  • User friendly naming conventions
  • Solid architectural infrastructure
  • Hardware, software, standards, metrics

Cultural Detractors
  • The two biggies
  • The business supported by the data warehouse must
    be motivated by a desire for constant improvement
    and fact-based decision making
  • The data warehouse team falls victim to the
    Politics of Data
  • Through naivety
  • Through misguided motives, themselves

Business Culture
I love data!
  • Does your CEO
  • Talk about constant improvement, constantly?
  • Drive corporate goals that are SMART?
  • Specific, Measurable, Attainable, Realistic,
  • Crave data to make better informed decisions?
  • Become visibly, buoyantly excited at a demo for a
    data cube?
  • If so, the success of your data warehouse is
    right around the corner sort of

Political Best Practices
  • You will be called a data thief
  • Get used to it
  • Encourage life cycle ownership of the OLTP data,
    even in the EDW
  • You will be called dangerous
  • You dont understand our data!
  • OLTP owners know their data better than you do
    acknowledge it and leverage it
  • You will be blamed for poor data quality in the
    OLTP systems
  • This is a natural reaction
  • Data warehouses raise the visibility of poor data
  • Use the EDW as a tool for raising overall data
  • You will be called a job robber
  • EDW is perceived as a replacement for OLTP
  • Educate people The EDW depends on OLTP systems
    for its existence
  • Stick to your values and pure motives
  • The politics will fade away

Data Quality
  • Pitfall
  • Taking accountability for data quality on the
    source system
  • Spending gobs of time and money cleansing data
    before its loaded into the DW
  • Its a never ending, never win battle
  • You will always be one step behind data quality
  • You will always be in the cross-hairs of
  • blame
  • Best Practice
  • Push accountability where it belongs to the
    source system
  • Use the data warehouse as a tool to reveal data
    quality, either good or bad
  • Be prepared to weather the initial storm of blame

Measuring Data Quality
  • Data Quality Completeness x Validity
  • Can it be measured objectively?
  • Measuring Completeness
  • Number of null values in a column
  • Measuring Validity
  • Cardinality is a simple way to measure validity
  • We only have four standard regions in the
    business, but we have 18 distinct values in the
    region column.

Business Validity
  • How can you measure it? You cant
  • I collect this data from our customers, but I
    have to guess sometimes because I dont speak
  • This data is valid for trend analysis decisions
    before 9/11/2001, but should not be used after
    that date, due to changes in security
  • You cant use insurance billing and
    reimbursement data to make clinical, patient care
  • This customer purchased four copies of Zamfir,
    Master of the Pan Flute, therefore he loves
    everything about Zamfir.
  • What Amazon didnt know I bought them for my
    mom and her sewing circle.

Where do you capture subjective data quality?
The Importance of Metadata
  • Maybe the most over-hyped, underserved area of
    data warehousing common sense
  • Vendors want to charge you big for their
  • Consultants would like you to think that its the
    Holy Grail in disguise and only they can help you
    find it
  • Authors who have never been in an operational
    environment would have you chasing your tail in
    pursuit of an esoteric, mythological Metadata
  • Dont listen to the confusing messages! You know
    the answer just listen to your common sense

Metadata Keep It Simple!
  • Ultimately, what are the most valuable business
    motives behind metadata?
  • Make data more understandable to those who are
    not familiar with it
  • Data quality issues
  • Data timeliness and temporal issues
  • Context in which is was collected
  • Translating physical names to natural
  • language
  • Make data more findable to those who
  • dont know where it is
  • Organize it
  • Take a lesson from library science and the card

Metadata on the Web
Example Metadata Entry
  • Physical Table Name
  • Plain Language Name
  • Postal Code Master - IHC
  • Description
  • Contains Postal (Zip) codes for the IHC referral
    region and IHC specific descriptions. These
    descriptions allow for specific IHC groupings
    used in various analyses. Data Steward Jim
    Allred, ext. 3518

Table Metadata
  • Required Elements
  • Long Name (or English name)
  • Description
  • Semi-optional
  • Source
  • Example
  • Data Steward

Column Metadata
  • Required
  • Long Name
  • Description
  • Notable ETL transforms
  • Optional
  • Value Range
  • Data Quality
  • Associated Lookup

The Data Model
Some Metadata Is Free
  • It can be collected from the data dictionary
  • For example
  • Primary and Foreign Keys
  • Indexed Columns
  • Table Creation Dates

The Most Valuable Metadata is Subjective
  • The Human Factor
  • Most metadata is not automatically collected by
    tools because it does NOT exist in that form
  • Interviews with data stewards are the key
  • It can take months of effort to collect initial

Holding Feet to the Fire
  • Data architects
  • Responsible for metadata in their subject areas
  • Torturing it from data stewards
  • Firm Principle
  • No new data marts go live without metadata

Is Metadata Worth The Effort?
  • Good metadata will push the ROI of a home-hum DW
    into the stratosphere
  • It does for DWing what the Yellow Pages did for
    the business ROI of the telephone

It Gets Used
  • At Intermountain Health Care
  • 210 web hits on average each week day
  • (23,000 employees, 2B revenue)

Whats New In Metadata
Report Quality
  • A function of
  • Data quality
  • How well does the report reflect the intent
    behind the question being asked?
  • This report doesnt make sense. Im trying to
    find out how many widgets we can produce next
    year, based on the last four years production.
  • Thats not what you asked for.
  • SQL and other programming accuracy
  • Statistical validity population size of the data
  • Timeliness of the data relative to the decision
  • Event Correlation
  • Best Practice
  • An accompanying meta-report for every report
    that involves significant, high risk decisions

Meta Report
  • A document, associated with a published report,
    which defines the report

  • A central place for storing and sharing
    information about business reports

IHC Analysts Use of Meta Reports
Meta Report
  • Recommended Elements
  • Business Owner
  • Run Frequency
  • Intended Audience
  • Statistical Tests
  • Software Used
  • Source Code
  • Formulas
  • Relevant Issues Commentary
  • Core Elements
  • Author Information
  • Report Name
  • Report Purpose
  • Data Source(s)
  • Report Methods

  • Title
  • Location
  • Author
  • Owner

  • Purpose
  • Frequency
  • Audience
  • Data Source(s)

  • Selection Criteria
  • Statistics
  • Software
  • Source Code
  • Formulas

Whats It Look Like?
(No Transcript)
Utilization and Creation Rate
EDW Operations Best Practices
  • Think Mission Control
  • Mission Control Commander
  • Dan Lidgard EDW Operations
  • Manager
  • Operations metrics
  • Customized ETL Library
  • Schedule of operations
  • Alerting tool
  • Storage strategies / backups
  • Performancemonitoring and tuning

IHC Architecture
  • EDW
  • Oracle v on AIX 5.2
  • Storage IBM SAN (shark), gt3T
  • ETL tools
  • Ascentials Data Stage
  • Kornshell (unix), SQL scripts, PL/SQL
  • scripting
  • OLAP MS Analysis Services
  • BI Business Objects (Crystal Enterprise)
  • With a Cube presentation layer
  • Dashboard Visual Minings Net Charts
  • EDW Team 16 FTEs, plus SAs and DBAs

CustomizedETL Library
  • One of the IHC ETL programmers noticed he kept
    doing the same things over and over for all of
    his ETL jobs
  • Rather than copying and pasting this repetitive
    code, he created a library.
  • Now everyone uses it on the team
  • Named the library EDW_UTIL
  • EDW Utilities

  • Executes via Oracle stored procedures
  • Supported by associated tables to hold data when
  • Error table
  • QA table
  • Index table

  • Provides standardization
  • Eliminates code rewrites
  • Can hide complexities
  • Such as the appropriate way to
  • analyze and gather statistics
  • on tables
  • Very accessible to all of
  • our ETL tools
  • Simply an Oracle stored
  • procedure call

Index Management
  • Past process included
  • Dropping the tables indexes with a script
  • Loading the table
  • Creating the indexes with a script
  • The past process resulted in messy
  • scripts to manage and
  • coordinate

Index Management
  • New process includes
  • Capturing a tables existing indexes metadata
  • Dropping the tables indexes with a single
    procedure call
  • Loading the table
  • Recreating the indexes with a single
  • procedure call
  • There are no more messy scripts to
  • manage and coordinate
  • No lost indexes were neglected
  • when adding to create index script

Index Management
  • Samples

Background Table Loads
  • Motive Higher availability ratings
  • Solution
  • Load data into tables which are not accessible to
    end users. A simple rename puts them into
  • Helps transfer the identical attributes from the
    live to the background table
  • Samples
  • CREATE_TABLE_INDEXES(Create on background table,
    identical to production table)

Tracking Load Times, Errors, QA
  • We had no idea who was loading what and when
  • Each staff member logged in their own way and for
    their own interest
  • ETL error capturing and QA was difficult
  • We can now capture errors and QA information in
    a somewhat standardized fashion

Load Times, Errors, QA
  • Samples
  • (ex CASEMIX)
  • COMPLETE_LOAD_TIME(Begin and end together)
  • LOAD_TIME_ERROR(Alert on these errors)
  • LOAD_TIME_METRICSQA (row counts)

Miscellaneous Procedures
  • Motive Hide the gory details and make Oracle
    look pretty
  • Such as Oracles table analyze command
  • Gives consistent application of system wide
    parameters such as
  • A new box with a different number of CPUs
    (parallel slaves)or
  • A new version of Oracle
  • We populate some metadata too, such as last load

DW Schedule of Operations
  • Motives
  • Choreography of operations
  • Avoid stepping on each others toes
  • Maximize shrinking windows of opportunity
  • Communicating with customers
  • Whats happening when
  • Challenges to a complete schedule
  • Some loads are ad hoc, not scheduled
  • Users query in an ad hoc fashion
  • Use of a variety of ETL tools
  • Use of a variety of user categories
  • DBA, SA, ETL user, end users
  • Lots of servers to manage
  • Production EDW, Stage EDW, ETL servers, OLAP
    servers, presentation layer servers

  • Focus on operations around the production EDW
  • Still working on all the reporting aspects
  • A sample on the next slide
  • Pull this information out of the load times
    data captured by these ETL library calls

Sample Report
DW Alerting Tool
  • Motive Business dashboard alerts
  • Aggregate data alerts, such as, The number of
    patients seen in the ER the last week with chief
    complaint of flu indicates the onset of an
  • A simple tool was created which sends a text
    email, based on existence of data returned from a
  • Healthcare isnt ready for this, yet
  • But, the EDW Team took advantage of the tool in
    unplanned ways

  • Web based
  • Open to all EDW users
  • Flexible run schedules
  • Daily, weekly, every two weeks, monthly,
  • Wakes every 5 minutes
  • Ability to enter query in SQL
  • Alert (email) on 3 situations
  • Query returns data
  • Query returns no data
  • Always

User Interface
  • 100 alerts in use
  • Live performance check
  • Every 4 hours
  • Alert Check for inactive sessions
  • holding active slaves
  • Daily
  • Alert Look for any active sessions older than
    72 hours
  • ETL monitoring
  • Alert only if problem
  • Alert on errors logged via the ETL_UTIL
  • library (manage by exception)
  • Alert on existence of bad records
  • captured during ETL

Storage and Backup
  • The inherited state of affairs
  • DW running like any OLTP database
  • High end expensive SANs
  • FULL nightly online backups
  • Out of space? Just buy more

Nightmare in the Making
  • Exponential growth
  • More data sources
  • More summary tables
  • More indexes
  • No data has yet been
  • purged
  • Relaxed attitude
  • Disk is cheap
  • Reality Disk management is expensive
  • Consequence Negating the cost benefits of
    cheaper storage by becoming careless with the

Looming Crisis
  • Backups running 16 hours or more
  • Performance degradation witnessed by users
  • Good backups obtained less than 50 of the time
  • Squeezing into load windows
  • Running out of space
  • Gross underestimating
  • Financial cost
  • The system administrators (SAs) quadruple the
    price of disk purchase from the previous budget
  • They included the price of tape drives, etc.

Major Changes in Operations
  • Stop performing full daily backups
  • Move towards less expensive disk option
  • IBM offers a several levels of SANs
  • Transfer some disk ownership AND backup
    responsibilities to DW team, away from SAs and
  • EDW team more aware of upcoming space demands
  • EDW team more in tune with which data sets are
    easily recreated from the source and dont need a

Tracking and PredictingStorage Use
Changes to Backup Strategy
  • Perform full backup once monthly during downtime
  • Perform no data backup on DEV/STAGE environments
  • Do backup DDL (all code) daily in all
  • Implement daily incremental backup

Daily Incremental Backups
  • Easier said than done
  • Resorted to a table level backup
  • The DW team owns which tables are exported
  • DW team populates a table, the export table
    list with each tables export frequency
  • Populated via an application in development
  • The DBAs run an export based on the export
    table list

Use Cheaper Disk
  • General practice You can take greater risks with
    DW reliability and availability vs. OLTP systems
  • Use it to your advantage
  • Our SAN vendor (IBM) offers a few levels of SANs.
    Next level down is a big step down in price,
    small step down in features.
  • Feature loss
  • Read cache (referring to disk cache, not box
  • We rarely read the same thing twice anyway
  • No phone home to IBM (auto paging)
  • Mean time to failure is higher, but still

Performance Monitoring Tuning
  • Err on the side of freedom and empowerment
  • How much harm can really be done?
  • Wed rather not constrain our customers
  • Pounding queries will find their way to
  • Opportunity to educate users
  • Opportunity to tune underlying structures

Performance Focus Areas
  • Indexing
  • Well-defined criteria for when and how to apply
  • Is this a lost art?
  • Big use of BITMAPS
  • Composite index trick (acts like a table)
  • Partitioning for performance, rather than data
  • Exploiting Oracles Direct Path INSERT feature
  • Avoiding UPDATE and DELETE commands
  • Copy with MINUS instead
  • Implementing Oracle's Parallel Query
  • Turn off referential integrity in the DW.. no
  • Thats the job of the source system

DW Monitoring Empowering End Users
  • Motive
  • Too many calls from end users about their queries
  • Please kill it.
  • Is it still running or is my PC locked up?
  • Why is the DW so slow?
  • Give them the insight and tools
  • Give them the ability to kill their own queries
  • Still in the works

The Insight
Tracking Long-Running Queries
  • At IHC, we used Pinecone (from Ambeo) to monitor
    all queries and the SQL
  • Each week, review the longest queries
  • Typical outcome?
  • Add indexes
  • Denormalize
  • Contact the user
  • and assist them with
  • writing a better query

The DW Sandbox
  • Motives
  • Lots of little MS Access
  • databases containing
  • valuable data, spread
  • all over the place
  • Analysts were joining
  • with DW data
  • Costly to maintain
  • Security issues
  • PC hogs
  • Solution
  • Provide customers with their own sandbox on the
    DW, with DBA-like privs

Sandbox Freedom
  • Web based tool for creating tables and loading
    disparate data to the DW
  • Simple, easy to use interface
  • Privileges
  • Users have full rights to the
  • tables they create
  • Can grant rights to others
  • Big, big victory for customer service and data
  • 10 of DW customers use the Sandbox
  • About 600 tables in use now
  • About 2G of data

Best Practices in the Design Phase
  • Design horizontally, build vertically
  • Start by building data marts that address
    analytic needs in one area of the business with a
    fairly limited data set
  • But, design with the horizontal needs of the
    company in mind, so that you will eventually
    tie all of these vertical data marts together
    with a common semantic layer, or data bus

Creating Value In Both Axes
For Example
Oncology Data Integration Strategy
Top down reporting requirements and data model
An Integrated Analytic Data Model For Cancer Care
Disparate Sources connected semantically to the
data bus
Cancer Registry
Mammography Radiology
Continuing Care And Follow-Up
Quality of Life Survey
Radiation Therapy
Health Plans Claims
Ambulatory Casemix
Acute Care Casemix
Development Process
Architectural Data Flow Diagram
Logical Data Model
Physical Data Model
Analytic Reporting Requirements
ETL Requirements
Source-to-Target Data Map and Logic
ETL Programming
Analytic Reporting System
Data Processing Requirements
Analytic Application Development
End User Interface Options
The Logic Layer in Data Warehouses
Transaction Systems
Data Layer
Logic Layer
Presentation Layer
Analytic Systems
Source System
Data Warehouse
ETL Process
Not Here
Evidence of Business Process Alignment
  • Map out your high level business process
  • Dont fall prey to analysis paralysis with
    endless business process modeling diagrams!
  • Identify and associate the transaction systems
    that support those processes
  • Identify the common, overlapping semantics/data
    attributes and their utilization rates
  • Build your data marts within an enterprise
  • that is aligned with the processes
  • you are trying to understand

For example
Claims Processing
Healthcare business process
Health Need
Patient Perception
Results Outcomes
Supported by non-integrated data in Transaction
Integrated in the Data Warehouse
Data Warehouse
Event Correlation
  • A leading edge Best Practice
  • The third dimension to rows and columns
  • Overlays the data that underlies a report or
  • In 2004, we experienced a drop in revenue as a
    result of the earthquake that destroyed our plant
    in the Philippines.
  • In January of 2005, we saw a spike in the North
    America market for snow shovel sales that
    coincided with an increase in sales for pain
    relievers. This correlates to the record
    snowfall in that region and should not be
    considered a trend. Barring major product
    innovation, we consider the market for snow
    shovels in this area as saturated. Sales will be
    slow for the next several years.

Standardizing Semantics
  • Sweet irony are the many synonyms for standard
  • Data dictionary
  • Vocabulary
  • Dimensions
  • Data elements
  • Data attributes
  • The bottom line issue Standardizing the terms
    you use to describe key facts about your business

Standardizing Names of Things
  • You better do it within the first two months of
    your data warehouse project
  • If you are beyond that point, you better stop and
    do it now, lest you pay a bigger price later
  • Dont
  • Push the standard on the source systems, unless
    its easy to accomplish
  • This was one of the common pitfalls of early data
    warehousing project failures
  • Try to standardize everything under
  • the sun!
  • Focus on the high value facts

Where Are The High Value Semantics?
In the high-overlap, high-utilization areas
Source System X
Another Perspective
Semantic Overlap
Semantic Utilization
The Standard Semantic Layer
Data Modeling
  • Star schemas are great and simple, but they
    arent the end-all, be-all of analytic data
  • Best practices Do what makes sense dont be a
    schema bigot
  • Ive seen great analytic value from 3NF models
  • Maintain data familiarity for your customers
  • When meeting vertical needs
  • Dont make massive changes to the way the model
    looks and feels, nor the naming conventions you
    will alienate existing users of the data
  • Use views to achieve new or standards-compliant
    perspectives on data
  • When meeting horizontal needs

For Example
Source perspective
DW perspective
Horizontal data customer
The Case For Timely Updates
Generally, to minimize Total Cost of Ownership
(TCO), your update frequency should be no
greater than the decision making cycle associated
with the data. But everyone wants more timely
Best Practice Measure Yourself
The Data Warehouse Dashboard
  • Total number of query-able tables
  • Total number of query-able columns
  • Number of users
  • Average rows delivered per month
  • Storage utilization
  • CPU utilization
  • Downtime per month by data mart
  • Employee satisfaction
  • Customer satisfaction
  • Average number of queries/month
  • Number of queries above a threshold (30 minutes?)
  • Average query response time
  • Total number of records

Other Best Practices
  • The Data Warehouse Information Systems Team
    reports to the CIO
  • Most data analysts can and probably should report
    to the business units
  • Change management/service level agreements with
    the source systems
  • No changes in the sources systems unless they are
    coordinated with the data warehouse team

More Best Practices
  • Skills of the Data Warehouse IS Team
  • Experienced chief architect/project manager
  • Procedural/script programmers
  • SQL/declarative programmers
  • Data warehouse storage management architects
  • Data warehouse hardware architects and system
  • Data architects/modelers
  • DBAs

More Best Practices
  • Evidence of project collaboration
  • A cross section of members and expertise from the
    data warehouse IS team
  • Statisticians and data analysts who understand
    the business domain
  • A customer that understands the process(es) being
    measured and can influence change
  • A data steward usually someone from the front
    lines who knows how the data is collected
  • Project Complex reports or a data mart

More Best Practices
  • When at all possible, always extract as close to
    the source as possible

The Most Popular Authors
  • I appreciate
  • The interest they stir
  • The vocabulary semantics of this new specialty
    that they helped create
  • The downside
  • The buzzwords that are more buzz than substance
  • Corporate Information Factories
  • Endless, meaningless debate
  • Thats not an Operational Data Store!
  • Do you follow Kimball or Inmon?
  • Follow your own common sense
  • Most of these authors have not had to build a
    data warehouse from scratch and live with their
    decisions through a complete lifecycle

ETL Operations
  • Besides the cultural risks and challenges, the
    riskiest part of a data warehouse
  • Good book
  • Westerman, WalMart Data Warehousing
  • The Extract, Transform, and Load
  • processes
  • Worthy of its own Best Practices discussion
  • Suffice to say, mitigate risks in this area
    carefully and deliberately
  • The major design errors dont show up until late
    in the lifecycle, when the cost of repair is great

Two Essential ETL Functions
  • Initial loads
  • How far back do we go in history?
  • Maintenance loads
  • Differential loads or total refresh?
  • How often?
  • You will run and tune these processes several
    times before you go into production
  • How many records are we dealing with?
  • How long will this take to run?
  • Whats the impact on the source system

Maintenance Loads
  • Total refresh vs. Incremental loads
  • Total refresh Truncate and reload everything
    from the source system
  • Incremental Load only the new and updated
  • For small data sets, a total refresh strategy is
    the easiest to implement
  • How do you define small? You will know it when
    dont see it.
  • Sometimes the fastest strategy when you are
    trying to show quick results
  • Grab and go

Incremental Loads
  • How do we get a snapshot of the data that has
    changed since the last load?
  • Many source systems will have an existing log
    file of some kind
  • Take advantage of these when you can, otherwise
    incremental loads can be complicated

File Transfer Formats
  • Design your extract so that it uses
  • Fixed, predetermined length for all records and
  • Avoid variable length if at all possible
  • A unique character that separates each field in a
    record, such as
  • A standard format for header records across all
    source systems
  • Such as the first three records in each file
  • Include name of source system,
  • file, and record count and
  • number of fields in the
  • record
  • This will be handy for
  • monitoring jobs and
  • collecting load metadata

Benefits of Standard File Transfer Format
  • Compatible with standard database and operating
    system utilities
  • Dynamically create initial and maintenance load
  • Read the table definitions (DDL) then merge that
    with the standard transfer file format
  • Dynamically generate load monitoring data
  • Read the header row, insert that into a Load
    Status table with status Running, of
    records, start time
  • At EOF, change status to Complete and capture
    end of load time
  • I wish I would have thought about this topic
    more, and earlier in my career

Westerman Makes A Good Point
  • If you design for instantaneous updates from the
    beginning, it translates to less ETL maintenance
    and labor time for the EDW staff, later
  • My experience ETL is the least tasteful and
    productive use of a veteran EDW Team member, so I
    like Westermans insight on this topic

Messaging Applied to ETL
  • Motive behind the concept
  • Continuous updates in a mixed workload
  • Mixed Analytical processing at the same time as
    transaction oriented, constant updates, deletes,
  • Basic concepts
  • Use a load message queue for records that need to
    be updated, coming from the source systems
  • When the EDW analytical processing workload is
    low (off-peak), pick the next message off the
    load queue and load the data
  • Run this in parallel so that you can process
    several load messages at the same time while you
    have a window of opportunity
  • Sometimes called throttling
  • Speed up and slow down based upon traffic

ETL Message Queue Process
Four Data Maintenance Processes
  • Initial load
  • Loading into an empty table
  • Append load
  • Update process
  • Delete process
  • As much as practical, use your database utilities
    for these processes
  • Study and know your database utilities for data
    warehousing they are getting better all the time
  • I see some bad strategies in this area i.e.,
    companies spending time building their own

A Few Planning Thoughts
  • Understand the percentage of records that will be
    updated, deleted, or inserted
  • Youll probably develop a different process for
    90 inserts vs. 90 updates
  • Logging
  • In general, turn logging off during the
    processes, if logging was on at all
  • Field vs. Record level updates
  • Some folks, in the interest of purity, will build
    complex update processes for passing only field
    (attribute) level changes
  • No brainer Pass the whole record

Initial Load
  • Every table will, at some time, require an
    initial load
  • For some tables, it will be the best choice for
    data maintenance
  • Total data refresh
  • Best for small tables
  • Simple process to implement
  • Simply delete (or truncate) and reload with fresh

A Better Initial Load Process
  • Background load
  • Safer protects against corrupt files
  • Higher availability to customers
  • Three or four steps maybe 6?
  • Create a temporary table
  • Load the temporary table
  • Run quality checks
  • Rename the temporary table to the production
    table name
  • Delete the old table
  • Regrant rights, if necessary
  • Westerman You want to use as many initial load
    processes as possible.
  • I agree!

Append Load
  • For larger tables that accumulate historical data
  • There are no updates, just appends
  • A hard fact that will not change
  • Example
  • Sales that are closed
  • Lab results

Append Load Options
  • Load a single part of a table
  • Load a partition and attach it to the table
  • Create a new, empty partition
  • Load the new records
  • Attach the partition to the table
  • Look to use the LOAD APPEND command in your

Another Append Option
  • Create a temp table identical to the one you are
  • Load the new records into the empty temp table
  • Delete the temp table

Update Process
  • The most difficult and risky to build
  • Use this process only if the tables are too large
    for a complete refresh, Initial Load process
  • Updates affect data that changes over time
  • Like Purchase Orders, hospital transactions, etc.
  • Medical records, if you treat the data
    maintenance at the macroscopic level

Update Process Options
  • Simple process
  • Separate the affected records into an update
    file, insert file, or delete file
  • Do this on the source system, if possible
  • Transfer the files to the data warehouse staging
  • Create and run two processes
  • A delete process for deleting the records in the
    production table that need updated or deleted
  • An insert process for inserting the entirely new
    updated record into the production table, as
    well as the true inserts
  • Simple, but typically not very fast

Simple Process
  • Delete Process identifies records for deletion
    from the Production Table based upon contents of
    the Updates file.
  • Delete Process identifies records for deletion
    from the Production Table based upon contents of
    the Deletes file.
  • Delete process deletes records from Production
  • Insert Process identifies records for insert to
    the Production Table based upon contents of the
    Updates file.
  • Insert Process identifies records for insert to
    the Production Table based upon contents of the
    Inserts file.
  • Insert Process inserts records into the
    Production Table.

Source System
Updated records
Deleted records
New records
EDW Staging Area
EDW Production Table
Delete Process
Insert Process
When You Are Unsure
  • Sometimes, source system log and audit files make
    it difficult to know if a record was updated or
    inserted (i.e. created)
  • Try this
  • Load the records into a temp table that is
    identical to the production table to be updated
  • Delete corresponding records from the production
  • DELETE FROM prod_table WHERE key_field
  • IN (SELECT temp_key_field FROM temp_table)
  • Insert all the records from the temp table into
    the production table
  • Most databases now support this with an UPSERT

Massive Deletes
  • Just as with Updates and Inserts, the number of
    Deletes you have to manage is inversely
    proportional to the frequency of your ETL
  • Infrequent ETL Massive data operations
  • Partitions work well for this, again
  • E.g., keeping a 5 year window of data
  • Insert most recent year with a partition
  • Delete the last years partition
  • Blazing fast!

Delete partition
Insert partition
Raw Data Standards for ETL
  • Makes the process of communicating with your
    source system partners much easier
  • Data type (e.g., format for date time stamps)
  • File formats (ASCII vs. EBCDIC)
  • Header records
  • Control characters
  • Rule of thumb
  • Never transfer data at the binary level unless
    you are transferring between binary compatible
    computer systems
  • Use only text-displayable characters
  • Less rework time vs. Less storage space and
    faster transfer speed
  • Storage and CPU time are cheap compared to labor

Last ThoughtIndexing Strategies
  • Define these early, practice them religiously,
    use them extensively
  • This is Database Design 101
  • Dont fall prey to this most common
  • performance problem!

In Summary
  • Best Practices
  • They are as much about what you should do, as
    what you should not do
  • Riskiest areas of data warehousing
  • Poor risk management
  • Risk Probability x Consequence
  • Data quality and ownership politics lead to
    project quagmire
  • Lack of data warehousing skills on the IS team
  • Bad basic old IS operations around the DW
  • Like storage, backup, indexing, managing by
  • Exploding and imploding ETL
  • A corporate culture that doesnt care about

My Thanks
  • For being invited
  • For your time and attention
  • For the many folks who have worked for and with
    me over the years that made me look better as a
  • Please contact me if you have any questions
  • PH 312-695-8618