Data Warehousing: Intermediate - PowerPoint PPT Presentation


PPT – Data Warehousing: Intermediate PowerPoint presentation | free to view - id: c0cea-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Data Warehousing: Intermediate


... in Information Technology consulting in the Minneapolis/St. Paul area since 1987 ... St. Salem NH 01458. 187 N. Park St Salem NH 01458. 187 N. Park Ave. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 28
Provided by: ericm71


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Warehousing: Intermediate

Data Warehousing Intermediate Advanced
Topics Common Problems, Uncommon Solutions
  • Eric Mellum
  • October 10, 2002

  • Introduction
  • Objectives
  • State of the Industry
  • Common Problems and Solutions, by Life Cycle
  • Recommended Approach
  • Requirements Gathering
  • DW/DM Design
  • ETL Design
  • Data Quality Management
  • Metadata Management
  • Performance Tuning
  • Training and Implementation
  • Ongoing DW Management
  • Q A

  • Eric Mellum
  • Practice Leader Talent Software Services
  • Consultant for over 12 years, 8 in DW and BI
  • Designed over 15 DW solutions, including 8 from
    conception through implementation
  • All life cycle phases, all roles, most often as
    architect and project manager
  • Talent Software Services, Inc.
  • A leader in Information Technology consulting in
    the Minneapolis/St. Paul area since 1987
  • Strong focus on Data Management since inception
  • Have assisted a wide variety of clients with all
    phases of the data warehousing life cycle, and
    all components of the business intelligence

  • Address the issues facing many organizations,
    recognizing the diverse audience
  • Emphasize what you most need to know, rather than
    regurgitating Kimball, Inmon, etc.
  • Help you maximize ROI, and minimize TCO
  • Clarify trade-offs and practical choices rather
    than preaching what youre supposed to do

State of the Industry
  • Most organizations have some data warehouse or
    data marts in place
  • Many fail to meet expectations
  • Even successful solutions have a number of
    ongoing challenges
  • Most discussion and press centers around tools
    and technologies the goals of the DW often seem
    to get lost

Data Warehousing Rationale
  • Companies want high-quality analytical
    information available on a timely basis, without
    having to spend a lot of time compiling data,
    reconciling or double-checking numbers, etc.
  • They want answers to questions about things like
  • Sales Performance
  • Market Penetration
  • Product Performance
  • Supplier Performance
  • Customer and Product Profitability
  • Promotion Effectiveness
  • But unfortunately

Data Warehousing Rationale
  • Data is dirty, un-integrated, and difficult to
  • Business rules are applied inconsistently,
    creating confusion about which are the right
  • Users spend too much time on data consolidation
    rather than data analysis and decision making
  • Business rules are embedded in code, invisible to
    end users
  • Ad hoc analysis is dependent on programmers
    delivery of new reports is slow and expensive
  • Multiple extracts exist to support similar
    reporting needs, without appropriate change
  • Maintainability small changes require changes
    to many extracts
  • Slow query performance

Data Warehousing Trends
  • A more business-driven approach (driving from
    business requirements and goals rather than
  • CRM revolution is driving DW changes
  • Design discipline and the incorporation of more
    best practices
  • Improved data quality
  • Consistency in business rules, useful
    business-oriented metadata, etc.
  • Timely addition of new data (more frequent loads
    of more current information)
  • Ease of use
  • Focus on requirements, perceived ROI, expanding
    data, integrating external data
  • Improved data administration processes
  • Performance (including both query performance and
    data load performance)
  • Integration of proprietary data with external
    data for improved market analysis information
  • Application of newer technologies (e.g.,
    web-based reporting solutions, alerts and report
    bursting, publication to wireless devices, data
    mining tools, extranets, portals, etc.) for
    increased effectiveness

Data Warehousing Architecture
Data and Metadata Repository Layer
Presentation Layer
ETL Layer
Source Systems
  • ExecutionSystems
  • CRM
  • ERP
  • Legacy
  • e-Commerce
  • Extract, Transformation, and Load (ETL) Layer
  • Cleanse Data
  • Filter Records
  • Standardize Values
  • Decode Values
  • Apply Business Rules
  • Householding
  • Dedupe Records
  • Merge Records

Enterprise Data Warehouse
Reporting Tools OLAP Tools Ad Hoc Query
Tools Data Mining Tools
Data Mart
Data Mart
  • External
  • Data
  • Purchased Market Data
  • Spreadsheets

Metadata Repository
Data Mart
Sample Technologies
  • Oracle
  • SQL Server
  • Teradata
  • DB2
  • Custom Tools
  • HTML Reports
  • Cognos
  • Business Objects
  • MicroStrategy
  • Oracle Discoverer
  • Brio
  • Data Mining Tools
  • Portals
  • ETL Tools
  • Informatica PowerMart
  • ETI
  • Oracle Warehouse Builder
  • Custom programs
  • SQL scripts
  • PeopleSoft
  • SAP
  • Siebel
  • Oracle Applications
  • Manugistics
  • Custom Systems

Common Problems and Solutions, by Life Cycle Phase
  • Recommended Approach
  • Requirements Gathering
  • DW/DM Design
  • ETL Design
  • Data Quality Management
  • Metadata Management
  • Performance Tuning
  • Training and Implementation
  • Ongoing DW Management

Recommended Approach
  • Drive development based on business goals and
    requirements rather than choosing a
    technology-driven approach
  • Establish strong business executive sponsorship
  • Involve the user community early and often
  • Plan for iterative development employing Rapid
    Application Development (RAD) avoid the big
    bang approach
  • Focus on a key business activity to deliver a
    timely solution quickly
  • Manage the scope (limit the subject areas, data
    elements, etc.)
  • Engage an experienced data warehouse modeler
  • Target quality rather than quantity thoroughly
    validate the data, business rule definitions, and
  • Design for scalability, and allow time for
    performance tuning
  • Include key reports as project deliverables
  • Target Business Education versus Technical

Requirements Gathering
  • Ask not just what do you need, but why
  • Expect a paradigm shift for end users they
    wont fully know what they want
  • Manage scope carefully, but expect scope creep
  • Be consultative in the process
  • Dont believe them when they tell you they all
    know and agree on what ______ means
  • Look beyond report re-creation
  • Consider the delivery of some reporting with the
    release to be a requirement
  • Begin data analysis while gathering requirements

DW/DM Design Problems
  • Database Designs tend to be too complex
  • Poor query performance
  • Subject areas not integrated or easily combined
  • Disappointing ROI

DW/DM Design Principles
  • Drive design from business requirements, not
    technical issues
  • Strive for simplicity in design
  • Use only the data of high quality and analytical
    value dont include it all just because
  • Develop star schemas based on a specific business
  • Utilize Conformed Dimensions where possible
  • Snowflaking just say no
  • Avoid letting source system structure dictate the
  • Table and Column Names use naming standards,
    balance familiar names with what makes sense
    make deviations clear
  • Plan for change, and design for flexibility
    acquisitions, mergers, and new source systems
  • 5-15 dimensions per star schema as a rule of
  • Use degenerate dimensions sparingly

DW/DM Design Decisions
  • Fact table granularities transaction, snapshot,
    or summary
  • Multiple Product or Customer Hierarchies
  • Physical Attributes vs. Arbitrary or Internal
  • Hot Swappable Dimensions
  • Beyond the Geography Dimension - Customers and
  • Mini-dimensions as an alternative to snowflaking
  • Slowly Changing Dimensions Types 1, 2, 3,
    hybrid, trade-offs
  • Mystery Dimensions
  • A solution to having several degenerate
  • Populated with Cartesian vs. encountered values
    vs. possible values
  • Time Dimension
  • Bridge Tables for many-to-many relationships
  • Fact table Primary Key
  • Aggregated measures as dimensional attributes

ETL Problems
  • Technical challenges moving, integrating, and
    transforming data from disparate environments
  • Short load windows, long load times
  • Inconsistent, difficult to maintain business
  • Lack of exposure of business rules to end users
  • Source systems missing certain critical data

ETL Design Decisions Tips
  • Use an ETL tool or not pros and cons
  • How much T to do this is where the value is,
    success vs. failure (devil is in the details)
  • Consolidate business rules in a single layer of
    the ETL, for consistency and ease of maintenance
  • Staging area or not, inclusion of audit
    attributes (raw values, etc.)
  • Surrogate Key assignment
  • Enforcing RI if where, how
  • Dealing with Missing or Unknown values
  • Incremental load strategies, dealing with flood
    from 1-time fixes
  • Error recovery and notification, what if load
    fails or data is bad?
  • To reject records or not only if you also
  • Determine that impact is immaterial
  • Alert end users (at the point of access!) and
    source system owners
  • Get the data cleaned up
  • Sequencing mappings for process efficiency

Data Quality Management
  • Importance, definition of clean
  • Not just that each field has valid values, but
    that together, the fields and records make
    business sense and support meaningful analysis
  • Is 99 good enough? It depends
  • Challenges with history, conversions, external
  • Garbage in, garbage out
  • Business rules change over time, as do system
  • Its always dirtier than you think

The Anomalies Nightmare
Digital Equipment
187 N. PARK St. Salem NH 01458
187 N. Pk. St. Salem NH 01458
187 N. Park St Salem NH 01458

Digital Corp
187 N. Park Ave. Salem NH 01458
Digital Consulting
15 Main Street Andover MA 02341
Digital Info Service
PO Box 9 Boston MA 02210
Mail List
Digital Integration
Park Blvd. Boston MA 04106
Noise in Blank Fields
No Unique Key
No Standardization
How does one correctly identify and consolidate
anomalies from millions of records?
Data Quality Management
  • Role of source system capture whats needed,
    cleanup errors
  • Data Validation
  • Allow time for discovery and cleanup 1 month
    per fact table
  • Early on, document relationships, valid values,
    encountered values, outdated values, etc.
  • Reconcile key measures to trusted sources and
  • Document and publish known differences
  • Automated Data Validation
  • Build sanity checks into ETL
  • Audit files, COBOL FD diff, etc.
  • Data Cleansing
  • Manual, custom routines, ISV data cleansing
  • Sample types of cleansing de-duping, filtering,
    merging records, decoding values, householding,
    claim form coding, and standardizing values,
    addresses, and business rules
  • Promoting data trust, role of metadata
  • Establish a process for early communication of
    source system changes

Metadata Problems
  • The term and concept is often used but not often
  • Metadata is often undervalued
  • Metadata is often implemented last, with low
  • Lack of maturity in tool market

Metadata Management
  • An essential piece of any DW solution
  • Improves decision making and productivity
  • Types of Metadata
  • User-oriented should be online, searchable,
    integrated with Front-End app
  • Table and column definitions, cautionary notes,
    calculation formulas, business rules, data
    freshness, etc.
  • Technical ideally shared between tools
  • Data source and mapping information, load status
    information, rejected record info, transformation
    rules, source system info, ETL window info,
    contact info, etc.
  • Keep metadata model simple

Performance Tuning
  • Establish reasonable performance requirements,
    and stop when theyre met
  • Follow a process, and dont corrupt architectural
  • Do proof of concept testing early on
  • Remove dormant columns
  • Tune the SQL when possible
  • Examine use of indexes, partitioning, PQO
  • Analyze database design
  • Build aggregates useful for multiple reports

Training and Implementation
  • Stress data orientation in addition to tool
  • Use real data and metadata
  • Prepare first impressions are extremely
  • Establish trust in data in key leaders before

Ongoing DW Management
  • Efficient database design is the first step in
    managing TCO
  • Expect large volume of enhancement requests
    keep data and metadata content fresh
  • Maintain architectural and design principles as
    you grow, involve architect in enhancement
    prioritization decisions
  • Consider utilizing specialized DW management
  • Monitor usage, and be able to answer
  • Who uses the DW, when, what data do they access,
    using what tools, what do they do with it?
  • Which are the longest running queries? Why do
    they run long?
  • Monitor dormant columns and be willing to drop
  • Monitor the use of indexes and partitioning
  • Measure Return On Investment track successes
  • Build good communication channels with source
    system owners
  • Be religious about data quality keep the trust

Thank You!
  • For additional information, please contact
  • Mike Egeland
  • 952.417.3646