Fundamentals of Data Warehousing - PowerPoint PPT Presentation


PPT – Fundamentals of Data Warehousing PowerPoint presentation | free to download - id: 6f59a0-ZTU2M


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Fundamentals of Data Warehousing


Fundamentals of Data Warehousing Dr. Akhtar Ali School of Computing, Engineering and Information Sciences – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 21
Provided by: UNIVERSI182
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Fundamentals of Data Warehousing

Fundamentals of Data Warehousing
  • Dr. Akhtar Ali
  • School of Computing, Engineering and Information

Lecture Outline
  • Inmons Four Characteristics of a DW
  • Subject-Oriented, Integrated, Time Variant and
  • Some Useful Definitions
  • Issues in Data Warehousing
  • DW Design
  • Extraction
  • Integration
  • Creation and Loading
  • DW Maintenance
  • Querying and Optimization

1.1 Subject-Oriented
  • Data is categorized and stored by business
    subject rather than
  • by application.

Equity Plans
Operational Systems
1.2 Integrated
Data on a given subject is defined and stored
Savings Application
No Application Flavor
Current Accounts Application
Loans Application
Subject Customer
Data Warehouse
Operational Environment
1.3 Time Variant
Data is stored as a series of snapshots, each
representing a period of time.
Data Warehouse
1.4 Non-Volatile
Typically data in the data warehouse is not
directly updated or deleted.
2.1 What is an Operational Data Store?
  • An operational data store (ODS) is the point of
    integration for operational/transaction-oriented
  • For example Banks typically have several
    independent systems set up to support different
    financial products e.g. loans, checking accounts,
    savings accounts etc.
  • The advent of ATMs helped push many banks to
    create an ODS to integrate current balances and
    recent transactional data from these separate
    accounts under one customer number.
  • Such ODSs are normally kept separate from a DW.
  • An ODS may be seen as a lowest layer of a DW for
    lower-management to access detailed as well as
    integrated data.
  • This means that an ODS may be seen as a front
    edge of a DW.
  • Such ODSs are normally kept as part of a DW.

2.2 What is a Data Mart?
  • A data mart (DMT) is a logical subset of a
    complete DW.
  • A DMT is a complete pie-wedge of the overall DW
  • A DW is a made up of the union of all its DMTs.
  • Some people take the definition literally. They
    create several independent DMTs to meet the needs
    of several departments.
  • Will everyone be happy? Well, maybe. There may be
    serious issues of integrating these DMTs
  • A DMT is an extension of a DW.
  • Data is integrated as it enters the DW. DMTs then
    derive data from the central source, the DW.
  • Each department gets its own DMT.
  • Each department determines which of the data
    warehouse contents are of interest .
  • These subject areas are then replicated into the
    smaller and local DMT so that users can get to
    the data they want with less interference from
    other departments.

2.3 What is On-Line Analytical Processing (OLAP) ?
  • OLAP is complementary to data warehousing.
  • OLAP embodies general activities of querying and
    presenting text and number data from DWs.
  • OLAP is based on dimensional modelling as opposed
    to entity-relationship (ER) modelling.
  • A dimensional model may contain the same
    information as an ER model but packages data in a
    symmetric form.
  • A dimensional model is geared towards user
    understandability and high performance query
  • ROLAP (Relational OLAP)
  • A set of user interfaces and applications that
    give a relational database (RDB) a dimensional
  • MOLAP (Multi-dimensional OLAP)
  • A set of user interfaces, applications and
    proprietary database technologies that have a
    strong dimensional flavour.

2.4 What is Data Mining (DM) ?
  • DM is often defined as finding hidden
    information in a database.
  • Alternatively, DM is exploratory data analysis,
    data driven discovery, and deductive learning.
  • Data mining software is a class of tools that
    apply artificial intelligence techniques to the
    analysis of data.
  • Given access to data, DM tools dig through the
    data looking for patterns and discovering
    relationships that the user might have never
  • DM tools work against an operational database or
    a DW.
  • Since data in a DW is usually integrated and
    summarized it may be more efficient to use it for
  • But a DM tool may find more useful information
    from an operational database (compared to a DW)
    as a DW usually hosts data to support anticipated
    DSS and may miss out data useful for DM.

3.0 Generic Data WarehouseArchitecture
3.1 Warehouse Design
  • Influenced by both maintenance and querying
  • Many trade-offs
  • Space vs. update time vs. query performance
  • Logical model of data
  • ER vs. Dimensional, Relational vs. OR vs. OO,
  • Identify sources of data
  • Identify warehouse data what to materialize?
  • Which summary tables?
  • Which fact/dimensional tables?
  • Which indices?
  • Choose software and hardware

3.2 Data Extraction
  • Selecting relevant data from data sources (DSs)
    and moving it into DW.
  • DS types
  • Database (e.g. relational), flat file, WWW, XML,
    COBOL, etc
  • How to obtain the data?
  • Using data replication servers/tools
  • Dump file or Export tools
  • Third party Wrappers/Middleware/Agents
  • Other activities data transformation, change
    detection (monitoring), cleansing etc.

3.2.1 Monitors
  • Detecting changes (of interest to a DW) in data
    sources and propagate to DW.
  • How?
  • Triggers
  • Replication servers/tools
  • Log Sniffer
  • Compare query results
  • Compare snapshots/dumps/exported data

3.2.1 Data Cleansing
  • Finds and removes duplicate tuples
  • For example Judie Harris Morris vs. Judie H.
  • Detect inconsistent or wrong data
  • Attribute values that do not match (because of
    wrong data types or violating certain constraints
    e.g. for Gender attribute a value of N meaning
    neither may be rejected).
  • Unreadable or incomplete data
  • Notify DSs of errors found during the cleansing

3.3 Data Integration
  • Receive data (changes) from multiple
    wrappers/monitors/data cleansers and integrate
    into DW.
  • Often Rule-based
  • Actions
  • Resolve inconsistencies
  • Eliminate duplicates
  • Integrate into DW
  • Summarize data
  • Fetch more data from DSs
  • Notify users that DW is now up-to-date

3.4 .1 Warehouse Loading
  • Includes all of the previous processes
  • Similar to loading/populating a database but
    complex due to heterogeneity of data and dealing
    with multiple DSs, possibly remote and external.
  • Building indices
  • Checking integrity constraints, etc.
  • Issue huge volumes of data but small time window
    to complete the process.
  • Computation of additional data
  • Auxiliary data to facilitate DW maintenance and
    support/speed up querying and analysis.

3.4.2 Warehouse Creation
  • A DW can be seen as a collection of materialized
    views (MVs) over DSs.
  • Contains a copy of data (collected from DSs)
    tailored to end-users.
  • Steps
  • Create DW schema (e.g. creating fact and
    dimensional tables, defining MVs)
  • Load warehouse
  • Start monitoring for changes at DSs
  • Update/Maintain DW as needed.

3.5 DW Maintenance
  • DSs on which a DW is based may change over time.
  • Changes at DSs may require changes at a DW.
  • How often to propagate changes to a DW?
  • At night, weekly/fortnightly/monthly,
    immediately, etc.
  • Off-line or on-line
  • Most current vendor products take a warehouse
    off-line during maintenance
  • How to propagate changes to a DW?
  • Completely re-build all affected tables at the DW
    (easy but inefficient)
  • Apply changes to affected tables incrementally
    (efficient but difficult)
  • Read my paper about MOVIE or wait until we
    discuss this topic in detail.

3.6 Querying and Optimization
  • Queries are long-running and complex
  • Multiple/Nested joins and aggregation
  • Usually touch all tuples kind of queries
  • Query language or analysis tools must support
    multi-dimensional operations
  • Pivot, Slice/Dice, Rollup, Percentile, etc
  • Standard SQL does not provide adequate operations
  • Solution pre-compute partial answers and reuse
  • Drawback it may increase DW maintenance
  • Emergence of warehouse management systems (WHMS
    e.g. ADMS, WHIPS)