Data%20Warehousing - PowerPoint PPT Presentation

About This Presentation
Title:

Data%20Warehousing

Description:

collection of data used in support of management decision making processes. ... DW Definition... Subject-Oriented: ... Stored collection of diverse data. A ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 63
Provided by: ValuedGate1
Category:

less

Transcript and Presenter's Notes

Title: Data%20Warehousing


1
Data Warehousing
  • University of California, Berkeley
  • School of Information Management and Systems
  • SIMS 257 Database Management

2
Lecture Outline
  • Review
  • Extending OR database systems
  • Java and JDBC
  • Data Warehouses
  • Introduction to Data Warehouses
  • Data Warehousing
  • (Based on lecture notes from Joachim Hammer,
    University of Florida, and Joe Hellerstein and
    Mike Stonebraker of UCB)

3
Lecture Outline
  • Review
  • Extending OR database systems
  • Java and JDBC
  • Data Warehouses
  • Introduction to Data Warehouses
  • Data Warehousing
  • (Based on lecture notes from Joachim Hammer,
    University of Florida, and Joe Hellerstein and
    Mike Stonebraker of UCB)

4
PostgreSQL Extensibility
  • Postgres is extensible because its operation is
    catalog-driven
  • RDBMS store information about databases, tables,
    columns, etc., in what are commonly known as
    system catalogs. (Some systems call this the data
    dictionary).
  • One key difference between Postgres and standard
    RDBMS is that Postgres stores much more
    information in its catalogs
  • not only information about tables and columns,
    but also information about its types, functions,
    access methods, etc.
  • These classes can be modified by the user, and
    since Postgres bases its internal operation on
    these classes, this means that Postgres can be
    extended by users
  • By comparison, conventional database systems can
    only be extended by changing hardcoded procedures
    within the DBMS or by loading modules
    specially-written by the DBMS vendor.

5
A more complex function
  • To illustrate a simple SQL function, consider the
    following, which might be used to debit a bank
    account
  • create function TP1 (int4, float8) returns int4
  • as update BANK set balance BANK.balance -
    2
  • where BANK.acctountno 1
  • select balance from bank
  • where accountno 1 language
    'sql'
  • A user could execute this function to debit
    account 17 by 100.00 as follows
  • select (x TP1( 17,100.0))

6
SQL Functions on Composite Types
  • When creating functions with composite types, you
    have to include the attributes of that argument.
    If EMP is a table containing employee data,
    (therefore also the name of the composite type
    for each row of the table) a function to double
    salary might be
  • CREATE FUNCTION double_salary(EMP) RETURNS
    integer
  • AS ' SELECT 1.salary 2 AS salary '
    LANGUAGE SQL
  • SELECT name, double_salary(EMP) AS dream FROM EMP
    WHERE EMP.cubicle point '(2,1)'
  • name dream
  • -------------
  • Sam 2400
  • Notice the use of the syntax 1.salary to select
    one field of the argument row value. Also notice
    how the calling SELECT command uses a table name
    to denote the entire current row of that table as
    a composite value.

7
New Type Definition
  • In the external language (usually C) functions
    are written for
  • Type input
  • From a text representation to the internal
    representation
  • Type output
  • From the internal represenation to a text
    representation
  • Can also define function and operators to
    manipulate the new type

8
Rules System
  • CREATE RULE name AS ON event
  • TO object WHERE condition
  • DO INSTEAD action NOTHING
  • Rules can be triggered by any event (select,
    update, delete, etc.)

9
Java and JDBC
  • Java is probably the high-level language used in
    most software development today one of the
    earliest enterprise additions to Java was JDBC
  • JDBC is an API that provides a mid-level access
    to DBMS from Java applications
  • Intended to be an open cross-platform standard
    for database access in Java
  • Similar in intent to Microsofts ODBC

10
JDBC
  • Provides a standard set of interfaces for any
    DBMS with a JDBC driver using SQL to specify
    the databases operations.

11
JDBC Simple Java Implementation
import java.sql. import oracle.jdbc. public
class JDBCSample public static void
main(java.lang.String args) try //
this is where the driver is loaded
//Class.forName("jdbc.oracle.thin")
DriverManager.registerDriver(new
OracleDriver()) catch (SQLException e)
System.out.println("Unable to load driver
Class") return
12
JDBC Simple Java Impl.
try //All DB access is within the
try/catch block... // make a connection to
ORACLE on Dream Connection con
DriverManager.getConnection(
"jdbcoraclethin_at_dream.sims.berkel
ey.edu1521dev", mylogin",
myoraclePW") // Do an SQL statement...
Statement stmt con.createStatement()
ResultSet rs stmt.executeQuery("SELECT NAME
FROM DIVECUST")
13
JDBC Simple Java Impl.
// show the Results... while(rs.next())
System.out.println(rs.getString("NAME"))
// Release the database
resources... rs.close()
stmt.close() con.close() catch
(SQLException se) // inform user of
errors... System.out.println("SQL Exception
" se.getMessage()) se.printStackTrace(Syst
em.out)
14
Lecture Outline
  • Review
  • Extending OR database systems
  • Java and JDBC
  • Data Warehouses
  • Introduction to Data Warehouses
  • Data Warehousing
  • (Based on lecture notes from Joachim Hammer,
    University of Florida, and Joe Hellerstein and
    Mike Stonebraker of UCB)

15
Overview
  • Data Warehouses and Merging Information Resources
  • What is a Data Warehouse?
  • History of Data Warehousing
  • Types of Data and Their Uses
  • Data Warehouse Architectures
  • Data Warehousing Problems and Issues

16
Problem Heterogeneous Information Sources
Heterogeneities are everywhere
Personal Databases
World Wide Web
Scientific Databases
Digital Libraries
  • Different interfaces
  • Different data representations
  • Duplicate and inconsistent information

Slide credit J. Hammer
17
Problem Data Management in Large Enterprises
  • Vertical fragmentation of informational systems
    (vertical stove pipes)
  • Result of application (user)-driven development
    of operational systems

Sales Planning
Suppliers
Num. Control
Stock Mngmt
Debt Mngmt
Inventory
...
...
...
Sales Administration
Finance
Manufacturing
...
Slide credit J. Hammer
18
Goal Unified Access to Data
Personal Databases
Digital Libraries
Scientific Databases
  • Collects and combines information
  • Provides integrated view, uniform user interface
  • Supports sharing

Slide credit J. Hammer
19
The Traditional Research Approach
  • Query-driven (lazy, on-demand)

Clients
Metadata
Integration System
. . .
Wrapper
Wrapper
Wrapper
. . .
Source
Source
Source
Slide credit J. Hammer
20
Disadvantages of Query-Driven Approach
  • Delay in query processing
  • Slow or unavailable information sources
  • Complex filtering and integration
  • Inefficient and potentially expensive for
    frequent queries
  • Competes with local processing at sources
  • Hasnt caught on in industry

Slide credit J. Hammer
21
The Warehousing Approach
  • Information integrated in advance
  • Stored in WH for direct querying and analysis

Slide credit J. Hammer
22
Advantages of Warehousing Approach
  • High query performance
  • But not necessarily most current information
  • Doesnt interfere with local processing at
    sources
  • Complex queries at warehouse
  • OLTP at information sources
  • Information copied at warehouse
  • Can modify, annotate, summarize, restructure,
    etc.
  • Can store historical information
  • Security, no auditing
  • Has caught on in industry

Slide credit J. Hammer
23
Not Either-Or Decision
  • Query-driven approach still better for
  • Rapidly changing information
  • Rapidly changing information sources
  • Truly vast amounts of data from large numbers of
    sources
  • Clients with unpredictable needs

Slide credit J. Hammer
24
Data Warehouse Evolution
Building the DW Inmon (1992)
Data Replication Tools
Relational Databases
Company DWs
2000
1995
1990
1985
1980
1960
1975
Information- Based Management
Data Revolution
Middle Ages
Prehistoric Times
TIME
PCs and Spreadsheets
End-user Interfaces
1st DW Article
DW Confs.
Vendor DW Frameworks
Slide credit J. Hammer
25
What is a Data Warehouse?
  • A Data Warehouse is a
  • subject-oriented,
  • integrated,
  • time-variant,
  • non-volatile
  • collection of data used in support of management
    decision making processes.
  • -- Inmon Hackathorn, 1994 viz. Hoffer, Chap 11

26
DW Definition
  • Subject-Oriented
  • The data warehouse is organized around the key
    subjects (or high-level entities) of the
    enterprise. Major subjects include
  • Customers
  • Patients
  • Students
  • Products
  • Etc.

27
DW Definition
  • Integrated
  • The data housed in the data warehouse are defined
    using consistent
  • Naming conventions
  • Formats
  • Encoding Structures
  • Related Characteristics

28
DW Definition
  • Time-variant
  • The data in the warehouse contain a time
    dimension so that they may be used as a
    historical record of the business

29
DW Definition
  • Non-volatile
  • Data in the data warehouse are loaded and
    refreshed from operational systems, but cannot be
    updated by end-users

30
What is a Data Warehouse?A Practitioners
Viewpoint
  • A data warehouse is simply a single, complete,
    and consistent store of data obtained from a
    variety of sources and made available to end
    users in a way they can understand and use it in
    a business context.
  • -- Barry Devlin, IBM Consultant

Slide credit J. Hammer
31
A Data Warehouse is...
  • Stored collection of diverse data
  • A solution to data integration problem
  • Single repository of information
  • Subject-oriented
  • Organized by subject, not by application
  • Used for analysis, data mining, etc.
  • Optimized differently from transaction-oriented
    db
  • User interface aimed at executive decision makers
    and analysts

32
Contd
  • Large volume of data (Gb, Tb)
  • Non-volatile
  • Historical
  • Time attributes are important
  • Updates infrequent
  • May be append-only
  • Examples
  • All transactions ever at WalMart
  • Complete client histories at insurance firm
  • Stockbroker financial information and portfolios

Slide credit J. Hammer
33
Warehouse is a Specialized DB
  • Standard DB
  • Mostly updates
  • Many small transactions
  • Mb - Gb of data
  • Current snapshot
  • Index/hash on p.k.
  • Raw data
  • Thousands of users (e.g., clerical users)
  • Warehouse
  • Mostly reads
  • Queries are long and complex
  • Gb - Tb of data
  • History
  • Lots of scans
  • Summarized, reconciled data
  • Hundreds of users (e.g., decision-makers,
    analysts)

Slide credit J. Hammer
34
Summary
Business Information Guide
Business Information Interface
Data Warehouse
Data Warehouse Catalog
Data Warehouse Population
Operational Systems
Enterprise Modeling
Slide credit J. Hammer
35
Warehousing and Industry
  • Warehousing is big business
  • 2 billion in 1995
  • 3.5 billion in early 1997
  • Predicted 8 billion in 1998 Metagroup
  • WalMart has largest warehouse
  • 900-CPU, 2,700 disk, 23 TB Teradata system
  • 7TB in warehouse
  • 40-50GB per day

Slide credit J. Hammer
36
Types of Data
  • Business Data - represents meaning
  • Real-time data (ultimate source of all business
    data)
  • Reconciled data
  • Derived data
  • Metadata - describes meaning
  • Build-time metadata
  • Control metadata
  • Usage metadata
  • Data as a product - intrinsic meaning
  • Produced and stored for its own intrinsic value
  • e.g., the contents of a text-book

Slide credit J. Hammer
37
Data Warehousing Architecture
38
Ingest
39
Data Warehouse Architectures Conceptual View
  • Single-layer
  • Every data element is stored once only
  • Virtual warehouse
  • Two-layer
  • Real-time derived data
  • Most commonly used approach in
  • industry today

Slide credit J. Hammer
40
Three-layer Architecture Conceptual View
  • Transformation of real-time data to derived data
    really requires two steps

View level Particular informational needs
Physical Implementation of the Data Warehouse
Slide credit J. Hammer
41
Issues in Data Warehousing
  • Warehouse Design
  • Extraction
  • Wrappers, monitors (change detectors)
  • Integration
  • Cleansing merging
  • Warehousing specification Maintenance
  • Optimizations
  • Miscellaneous (e.g., evolution)

Slide credit J. Hammer
42
Data Warehousing Two Distinct Issues
  • (1) How to get information into warehouse
  • Data warehousing
  • (2) What to do with data once its in warehouse
  • Warehouse DBMS
  • Both rich research areas
  • Industry has focused on (2)

Slide credit J. Hammer
43
Data Extraction
  • Source types
  • Relational, flat file, WWW, etc.
  • How to get data out?
  • Replication tool
  • Dump file
  • Create report
  • ODBC or third-party wrappers

Slide credit J. Hammer
44
Wrapper
  • Converts data and queries from one data model to
    another

Queries
Data Model A
Data
  • Extends query capabilities for sources with
    limited capabilities

Queries
Slide credit J. Hammer
45
Wrapper Generation
  • Solution 1 Hard code for each source
  • Solution 2 Automatic wrapper generation

Wrapper Generator
Definition
Wrapper
Slide credit J. Hammer
46
Data Transformations
  • Convert data to uniform format
  • Byte ordering, string termination
  • Internal layout
  • Remove, add reorder attributes
  • Add key
  • Add data to get history
  • Sort tuples

Slide credit J. Hammer
47
Monitors
  • Goal Detect changes of interest and propagate to
    integrator
  • How?
  • Triggers
  • Replication server
  • Log sniffer
  • Compare query results
  • Compare snapshots/dumps

Slide credit J. Hammer
48
Data Integration
  • Receive data (changes) from multiple
    wrappers/monitors and integrate into warehouse
  • Rule-based
  • Actions
  • Resolve inconsistencies
  • Eliminate duplicates
  • Integrate into warehouse (may not be empty)
  • Summarize data
  • Fetch more data from sources (wh updates)
  • etc.

Slide credit J. Hammer
49
Data Cleansing
  • Find ( remove) duplicate tuples
  • e.g., Jane Doe vs. Jane Q. Doe
  • Detect inconsistent, wrong data
  • Attribute values that dont match
  • Patch missing, unreadable data
  • Notify sources of errors found

Slide credit J. Hammer
50
Warehouse Maintenance
  • Warehouse data ? materialized view
  • Initial loading
  • View maintenance
  • View maintenance

Slide credit J. Hammer
51
Differs from Conventional View Maintenance...
  • Warehouses may be highly aggregated and
    summarized
  • Warehouse views may be over history of base data
  • Process large batch updates
  • Schema may evolve

Slide credit J. Hammer
52
Differs from Conventional View Maintenance...
  • Base data doesnt participate in view maintenance
  • Simply reports changes
  • Loosely coupled
  • Absence of locking, global transactions
  • May not be queriable

Slide credit J. Hammer
53
Warehouse Maintenance Anomalies
  • Materialized view maintenance in loosely coupled,
    non-transactional environment
  • Simple example

Slide credit J. Hammer
54
Warehouse Maintenance Anomalies
Slide credit J. Hammer
55
Maintenance Anomaly - Solutions
  • Incremental update algorithms (ECA, Strobe, etc.)
  • Research issues Self-maintainable views
  • What views are self-maintainable
  • Store auxiliary views so original auxiliary
    views are self-maintainable

Slide credit J. Hammer
56
Self-Maintainability Examples
  • Sold(item,clerk,age)
  • Sale(item,clerk) Emp(clerk,age)
  • Inserts into Emp
  • If Emp.clerk is key and Sale.clerk is foreign key
    (with ref. int.) then no effect
  • Inserts into Sale
  • Maintain auxiliary view
  • Emp-?clerk,age(Sold)
  • Deletes from Emp
  • Delete from Sold based on clerk

Slide credit J. Hammer
57
Self-Maintainability Examples
  • Deletes from Sale
  • Delete from Sold based on item,clerk
  • Unless age at time of sale is relevant
  • Auxiliary views for self-maintainability
  • Must themselves be self-maintainable
  • One solution all source data
  • But want minimal set

Slide credit J. Hammer
58
Partial Self-Maintainability
  • Avoid (but dont prohibit) going to sources
  • SoldSale(item,clerk) Emp(clerk,age)
  • Inserts into Sale
  • Check if clerk already in Sold, go to source if
    not
  • Or replicate all clerks over age 30
  • Or ...

Slide credit J. Hammer
59
Warehouse Specification (ideally)
View Definitions
Warehouse Configuration Module
Warehouse
Integration rules
Integrator
Metadata
Change Detection Requirements
Extractor/ Monitor
Extractor/ Monitor
Extractor/ Monitor
...
Slide credit J. Hammer
60
Optimization
  • Update filtering at extractor
  • Similar to irrelevant updates in constraint and
    view maintenance
  • Multiple view maintenance
  • If warehouse contains several views
  • Exploit shared sub-views

Slide credit J. Hammer
61
Additional Research Issues
  • Historical views of non-historical data
  • Expiring outdated information
  • Crash recovery
  • Addition and removal of information sources
  • Schema evolution

Slide credit J. Hammer
62
More Information on DW
  • Agosta, Lou, The Essential Guide to Data
    Warehousing. Prentise Hall PTR, 1999.
  • Devlin, Barry, Data Warehouse, from Architecture
    to Implementation. Addison-Wesley, 1997.
  • Inmon, W.H., Building the Data Warehouse. John
    Wiley, 1992.
  • Widom, J., Research Problems in Data
    Warehousing. Proc. of the 4th Intl. CIKM Conf.,
    1995.
  • Chaudhuri, S., Dayal, U., An Overview of Data
    Warehousing and OLAP Technology. ACM SIGMOD
    Record, March 1997.
Write a Comment
User Comments (0)
About PowerShow.com