What is a Data Warehouse And Why Are So Many Schools Setting Them Up - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

What is a Data Warehouse And Why Are So Many Schools Setting Them Up

Description:

To create an ODS you. Build a separate/simplified version of an OLTP system ... Answer (raw) Count (1) Oops. Not a star. Snowflaked! ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 38
Provided by: richardlgo
Category:

less

Transcript and Presenter's Notes

Title: What is a Data Warehouse And Why Are So Many Schools Setting Them Up


1
What is a Data Warehouse?And Why Are So Many
Schools Setting Them Up?
  • Richard Goerwitz

2
What Is a Data Warehouse?
  • Nobody can agree
  • So Im not actually going to define a DW
  • Dont feel cheated, though
  • By the end of this talk, youll
  • Understand key concepts that underlie all
    warehouse implementations (talk the talk)
  • Understand the various components out of which DW
    architects construct real-world data warehouses
  • Understand what a data warehouse project looks
    like

3
Why Are Schools Setting UpData Warehouses?
  • A data warehouse makes it easier to
  • Optimize classroom, computer lab usage
  • Refine admissions ratings systems
  • Forecast future demand for courses, majors
  • Tie private spreadsheet data into central
    repositories
  • Correlate admissions and IR data with outcomes
    such as
  • GPAs
  • Placement rates
  • Happiness, as measured by alumni surveys
  • Notify advisors when extra help may be needed
    based on
  • Admissions data (student vitals SAT, etc.)
  • Special events A-student suddenly gets a C in
    his/her major
  • Slower trends Students GPA falls for gt 2
    semesters/terms
  • (Many other examples could be given!)
  • Better information better decisions
  • Better admission decisions
  • Better retention rates
  • More effective fund raising, etc.

4
Talking The Talk
  • To think and communicate usefully about data
    warehouses youll need to understand a set of
    common terms and concepts
  • OLTP
  • ODS
  • OLAP, ROLAP, MOLAP
  • ETL
  • Star schema
  • Conformed dimension
  • Data mart
  • Cube
  • Metadata
  • Even if youre not an IT person, pay heed
  • Youll have to communicate with IT people
  • More importantly
  • Evidence shows that IT will only build a
    successful warehouse if you are intimately
    involved!

5
OLTP
  • OLTP online transaction processing
  • The process of moving data around to handle
    day-to-day affairs
  • Scheduling classes
  • Registering students
  • Tracking benefits
  • Recording payments, etc.
  • Systems supporting this kind of activity are
    called transactional systems

6
Transactional Systems
  • Transactional systems are optimized primarily for
    the here and now
  • Can support many simultaneous users
  • Can support heavy read/write access
  • Allow for constant change
  • Are big, ugly, and often dont give people the
    data they want
  • As a result a lot of data ends up in shadow
    databases
  • Some ends up locked away in private spreadsheets
  • Transactional systems dont record all previous
    data states
  • Lots of data gets thrown away or archived, e.g.
  • Admissions data
  • Enrollment data
  • Asset tracking data (How many computers did we
    support each year, from 1996 to 2006, and where
    do we expect to be in 2010?)

7
Simple Transactional Database
  • Map of Microsoft Windows Update Service (WUS)
    back-end database
  • Diagrammed using Sybase PowerDesigner
  • Each green box is a database table
  • Arrows are joins or foreign keys
  • This is simple for an OLTP back end

8
More Complex Example
  • Recruitment Plus back-end database
  • Used by many admissions offices
  • Note again
  • Green boxes are tables
  • Lines are foreign key relationships
  • Purple boxes are views
  • Considerable expertise is required to report off
    this database!
  • Imagine what its like for even more complex
    systems
  • Colleague
  • SCT Banner (over 4,000 tables)

9
The Reporting Problem
  • Often we require OLTP data as a snapshot, in a
    spreadsheet or report
  • Reports require querying back-end OLTP support
    databases
  • But OLTP databases are often very complex, and
    typically
  • Contain many, often obscure, tables
  • Utilize cryptic, unintuitive field/column names
  • Dont store all necessary historical data
  • As a result, reporting becomes a problem
  • Requires special expertise
  • May require modifications to production OLTP
    systems
  • Becomes harder and harder for staff to keep up!

10
Workarounds
  • Ways of working around the reporting problem
    include
  • Have OLTP system vendors do the work
  • Provide canned reports
  • Write reporting GUIs for their products
  • Hire more specialists
  • To create simplified views of OLTP data
  • To write reports, create snapshots
  • Periodically copy data from OLTP systems to a
    place where
  • The data is easier to understand
  • The data is optimized for reporting
  • Easily pluggable into reporting tools

11
ODS
  • ODS operational data store
  • ODSs were an early workaround to the reporting
    problem
  • To create an ODS you
  • Build a separate/simplified version of an OLTP
    system
  • Periodically copy data into it from the live OLTP
    system
  • Hook it to operational reporting tools
  • An ODS can be an integration point or real-time
    reporting database for an operational system
  • Its not enough for full enterprise-level,
    cross-database analytical processing

12
OLAP
  • OLAP online analytical processing
  • OLAP is the process of creating and summarizing
    historical, multidimensional data
  • To help users understand the data better
  • Provide a basis for informed decisions
  • Allow users to manipulate and explore data
    themselves, easily and intuitively
  • More than just reporting
  • Reporting is just one (static) product of OLAP

13
OLAP Support Databases
  • OLAP systems require support databases
  • These databases typically
  • Support fewer simultaneous users than OLTP back
    ends
  • Are structured simply i.e., denormalized
  • Can grow large
  • Hold snapshots of data in OLTP systems
  • Provide history/time depth to our analyses
  • Are optimized for read (not write) access
  • Updated via periodic batch (e.g., nightly) ETL
    processes

14
ETL Processes
  • ETL extract, transform, load
  • Extract data from various sources
  • Transform and clean the data from those sources
  • Load the data into databases used for analysis
    and reporting
  • ETL processes are coded in various ways
  • By hand in SQL, UniBASIC, etc.
  • Using more general programming languages
  • In semi-automated fashion using specialized ETL
    tools like Cognos Decision Stream
  • Most institutions do hand ETL but note well
  • Hand ETL is slow
  • Requires specialized knowledge
  • Becomes extremely difficult to maintain as code
    accumulates and databases/personnel change!

15
Where Does the Data Go?
  • What sort of a database do the ETL processes dump
    data into?
  • Typically, into very simple table structures
  • These table structures are
  • Denormalized
  • Minimally branched/hierarchized
  • Structured into star schemas

16
So What Are Star Schemas?
  • Star schemas are collections of data arranged
    into star-like patterns
  • They have fact tables in the middle, which
    contain amounts, measures (like counts, dollar
    amounts, GPAs)
  • Dimension tables around the outside, which
    contain labels and classifications (like names,
    geocodes, majors)
  • For faster processing, aggregate fact tables are
    sometimes also used (e.g., counts pre-averaged
    for an entire term)
  • Star schemas should
  • Have descriptive column/field labels
  • Be easy for users to understand
  • Perform well on queries

17
A Very Simple Star Schema
  • Data Center UPS
  • Power Output
  • Dimensions
  • Phase
  • Time
  • Date
  • Facts
  • Volts
  • Amps
  • Etc.

18
A More Complex Star Schema
  • Freshman survey data (HERI/CIRP)
  • Dimensions
  • Questions
  • Survey years
  • Data about test takers
  • Facts
  • Answer (text)
  • Answer (raw)
  • Count (1)
  • Oops
  • Not a star
  • Snowflaked!

Oops, answers should have been placed in their
own dimension (creating a factless fact table).
Ill demo a better version of this star later!
19
Data Marts
  • One definition
  • One or more star schemas that present data on a
    single or related set of business processes
  • Data marts should not be built in isolation
  • They need to be connected via dimensional tables
    that are
  • The same or subsets of each other
  • Hierarchized the same way internally
  • So, e.g., if I construct data marts for
  • GPA trends, student major trends, enrollments
  • Freshman survey data, senior survey data, etc.
  • I connect these marts via a conformed student
    dimension
  • Makes correlation of data across star schemas
    intuitive
  • Makes it easier for OLAP tools to use the data
  • Allows nonspecialists to do much of the work

20
Simple Data Mart Example
  • UPS
  • Battery star
  • By battery
  • Run-time
  • charged
  • Current
  • Input star
  • By phase
  • Voltage
  • Current
  • Output star
  • By phase
  • Voltage
  • Current
  • Sensor star
  • By sensor
  • Temp
  • Humidity

Note conformed date, time dimensions!
21
CIRP Star/Data Mart
  • CIRP Freshman survey data
  • Corrected from a previous slide
  • Note the CirpAnswer dimension
  • Note student dimension (ties in with other marts)

22
CIRP Mart in Cognos BI 8
23
ROLAP, MOLAP
  • ROLAP OLAP via direct relational query
  • E.g., against a (materialized) view
  • Against star schemas in a warehouse
  • MOLAP OLAP via multidimensional database (MDB)
  • MDB is a special kind of database
  • Treats data kind of like a big, fast spreadsheet
  • MDBs typically draw data in from a data warehouse
  • Built to work best with star schemas

24
Data Cubes
  • The term data cube means different things to
    different people
  • Various definitions
  • A star schema
  • Any DB view used for reporting
  • A three-dimensional array in a MDB
  • Any multidimensional MDB array (really a
    hypercube)
  • Which definition do you suppose is technically
    correct?

25
Metadata
  • Metadata data about data
  • In a data warehousing context it can mean many
    things
  • Information on data in source OLTP systems
  • Information on ETL jobs and what they do to the
    data
  • Information on data in marts/star schemas
  • Documentation in OLAP tools on the data they
    manipulate
  • Many institutions make metadata available via
    data malls or warehouse portals, e.g.
  • University of New Mexico
  • UC Davis
  • Rensselear Polytechnic Institute
  • University of Illinois
  • Good ETL tools automate the setup of
    malls/portals!

26
The Data Warehouse
  • OK now were experts in terms like OLTP, OLAP,
    star schema, metadata, etc.
  • Lets use some of these terms to describe how a
    DW works
  • Provides ample metadata data about the data
  • Utilizes easy-to-understand column/field names
  • Feeds multidimensional databases (MDBs)
  • Is updated via periodic (mainly nightly) ETL jobs
  • Presents data in a simplified, denormalized form
  • Utilizes star-like fact/dimension table schemas
  • Encompasses multiple, smaller data marts
  • Supports OLAP tools (Access/Excel, Safari, Cognos
    BI)
  • Derives data from (multiple) back-end OLTP
    systems
  • Houses historical data, and can grow very big

27
A Data Warehouse is Not
  • Vendor and consultant proclamations aside, a data
    warehouse is not
  • A project
  • With a specific end date
  • A product you buy from a vendor
  • Like an ODS (such as SCTs)
  • A canned warehouse supplied by iStrategy
  • Cognos ReportNet
  • A database schema or instance
  • Like Oracle
  • SQL Server
  • A cut-down version of your live transactional
    database

28
Kimball Casertas Definition
  • According to Ralph Kimball and Joe Caserta, a
    data warehouse is
  • A system that extracts, cleans, conforms, and
    delivers source data into a dimensional data
    store and then supports and implements querying
    and analysis for the purpose of decision making.
  • Another def. The union of all the enterprises
    data marts
  • Aside The Kimball model is not without some
    critics
  • E.g., Bill Inmon

29
Example Data Warehouse (1)
  • This one is RPIs
  • 5 parts
  • Sources
  • ETL stuff
  • DW proper
  • Cubes etc.
  • OLAP apps

30
Example Data Warehouse (2)
  • Caltechs DW
  • Five Parts
  • Source systems
  • ETL processes
  • Data marts
  • FM/metadata
  • Reporting and analysis tools
  • Note Theyre also customers of Cognos!

31
So Where is Colorado College?
  • Phil Goldstein (Educause Center for Applied
    Research fellow) identifies the major deployment
    levels
  • Level 1 Transactional systems only
  • Level 2a ODS or single data mart no ETL
  • Level 2 ODS or single data mart with ETL tools
  • Level 3a Warehouse or multiple marts no ETL
    OLAP
  • Level 3b Warehouse or multiple marts ETL OLAP
  • Level 3 Enterprise-wide warehouse or multiple
    marts ETL tools OLAP tools
  • Goldsteins study was just released in late 2005
  • Its very good based on real survey data
  • Which level is Colorado College at?

32
Implementing a Data Warehouse
  • In many organizations IT people want to huddle
    and work out a warehousing plan, but in fact
  • The purpose of a DW is decision support
  • The primary audience of a DW is therefore College
    decision makers
  • It is College decision makers therefore who must
    determine
  • Scope
  • Priority
  • Resources
  • Decision makers cant make these determinations
    without an understanding of data warehouses
  • It is therefore imperative that key decision
    makers first be educated about data warehouses
  • Once this occurs, it is possible to
  • Elicit requirements (a critical step thats often
    skipped)
  • Determine priorities/scope
  • Formulate a budget
  • Create a plan and timeline, with real milestones
    and deliverables!

33
Is This Really a Good Plan?
  • Sure, according to Phil Goldstein (Educause
    Center for Applied Research)
  • Hes conducted extensive surveys on academic
    analytics ( business intelligence for higher
    ed)
  • His four recommendations for improving analytics
  • Key decisionmakers must lead the way
  • Technologists must collaborate
  • Must collect requirements
  • Must form strong partnerships with functional
    sponsors
  • IT must build the needed infrastructure
  • Carleton violated this rule with Cognos BI
  • As we discovered, without an ETL/warehouse
    infrastructure, success with OLAP is elusive
  • Staff must train and develop deep analysis skills
  • Goldsteins findings mirror closely the advice of
    industry heavyweights Ralph Kimball, Laura
    Reeves, Margie Ross, Warren Thornthwaite, etc.

34
Isnt a DW a Huge Undertaking?
  • Sure, it can be huge
  • Dont hold on too tightly to the big-sounding
    word, warehouse
  • Luminaries like Ralph Kimball have shown that a
    data warehouse can be built incrementally
  • Can start with just a few data marts
  • Targeted consulting help will ensure proper,
    extensible architecture and tool selection

35
What Takes Up the Most Time?
  • You may be surprised to learn what DW step takes
    the most time
  • Try guessing which
  • Hardware
  • Physical database setup
  • Database design
  • ETL
  • OLAP setup

Acc. to Kimball Caserta, ETL will eat up 70 of
the time.Other analysts give estimates ranging
from 50 to 80. The most often underestimated
part of the warehouse project!
36
Eight Month Initial Deployment
37
Conclusion
  • Information is held in transactional systems
  • But transactional systems are complex
  • They dont talk to each other well each is a
    silo
  • They require specially trained people to report
    off of
  • For normal people to explore institutional data,
    data in transactional systems needs to be
  • Renormalized as star schemas
  • Moved to a system optimized for analysis
  • Merged into a unified whole in a data warehouse
  • Note This process must be led by customers
  • Yes, IT people must build the infrastructure
  • But IT people arent the main customers
  • So who are the customers?
  • Admissions officers trying to make good admission
    decisions
  • Student counselors trying to find/help students
    at risk
  • Development offers raising funds that support the
    College
  • Alumni affairs people trying to manage volunteers
  • Faculty deans trying to right-size departments
  • IT people managing software/hardware assets, etc.
Write a Comment
User Comments (0)
About PowerShow.com