Title: Data%20Warehousing
1Data Warehousing
- University of California, Berkeley
- School of Information Management and Systems
- SIMS 257 Database Management
2Lecture Outline
- Review
- Extending OR database systems
- Java and JDBC
- Data Warehouses
- Introduction to Data Warehouses
- Data Warehousing
- (Based on lecture notes from Joachim Hammer,
University of Florida, and Joe Hellerstein and
Mike Stonebraker of UCB) -
3Lecture Outline
- Review
- Extending OR database systems
- Java and JDBC
- Data Warehouses
- Introduction to Data Warehouses
- Data Warehousing
- (Based on lecture notes from Joachim Hammer,
University of Florida, and Joe Hellerstein and
Mike Stonebraker of UCB) -
4PostgreSQL Extensibility
- Postgres is extensible because its operation is
catalog-driven - RDBMS store information about databases, tables,
columns, etc., in what are commonly known as
system catalogs. (Some systems call this the data
dictionary). - One key difference between Postgres and standard
RDBMS is that Postgres stores much more
information in its catalogs - not only information about tables and columns,
but also information about its types, functions,
access methods, etc. - These classes can be modified by the user, and
since Postgres bases its internal operation on
these classes, this means that Postgres can be
extended by users - By comparison, conventional database systems can
only be extended by changing hardcoded procedures
within the DBMS or by loading modules
specially-written by the DBMS vendor.
5A more complex function
- To illustrate a simple SQL function, consider the
following, which might be used to debit a bank
account - create function TP1 (int4, float8) returns int4
- as update BANK set balance BANK.balance -
2 - where BANK.acctountno 1
- select balance from bank
- where accountno 1 language
'sql' - A user could execute this function to debit
account 17 by 100.00 as follows - select (x TP1( 17,100.0))
6SQL Functions on Composite Types
- When creating functions with composite types, you
have to include the attributes of that argument.
If EMP is a table containing employee data,
(therefore also the name of the composite type
for each row of the table) a function to double
salary might be - CREATE FUNCTION double_salary(EMP) RETURNS
integer - AS ' SELECT 1.salary 2 AS salary '
LANGUAGE SQL - SELECT name, double_salary(EMP) AS dream FROM EMP
WHERE EMP.cubicle point '(2,1)' - name dream
- -------------
- Sam 2400
- Notice the use of the syntax 1.salary to select
one field of the argument row value. Also notice
how the calling SELECT command uses a table name
to denote the entire current row of that table as
a composite value.
7New Type Definition
- In the external language (usually C) functions
are written for - Type input
- From a text representation to the internal
representation - Type output
- From the internal represenation to a text
representation - Can also define function and operators to
manipulate the new type
8Rules System
- CREATE RULE name AS ON event
- TO object WHERE condition
- DO INSTEAD action NOTHING
- Rules can be triggered by any event (select,
update, delete, etc.)
9Java and JDBC
- Java is probably the high-level language used in
most software development today one of the
earliest enterprise additions to Java was JDBC - JDBC is an API that provides a mid-level access
to DBMS from Java applications - Intended to be an open cross-platform standard
for database access in Java - Similar in intent to Microsofts ODBC
10JDBC
- Provides a standard set of interfaces for any
DBMS with a JDBC driver using SQL to specify
the databases operations.
11JDBC Simple Java Implementation
import java.sql. import oracle.jdbc. public
class JDBCSample public static void
main(java.lang.String args) try //
this is where the driver is loaded
//Class.forName("jdbc.oracle.thin")
DriverManager.registerDriver(new
OracleDriver()) catch (SQLException e)
System.out.println("Unable to load driver
Class") return
12JDBC Simple Java Impl.
try //All DB access is within the
try/catch block... // make a connection to
ORACLE on Dream Connection con
DriverManager.getConnection(
"jdbcoraclethin_at_dream.sims.berkel
ey.edu1521dev", mylogin",
myoraclePW") // Do an SQL statement...
Statement stmt con.createStatement()
ResultSet rs stmt.executeQuery("SELECT NAME
FROM DIVECUST")
13JDBC Simple Java Impl.
// show the Results... while(rs.next())
System.out.println(rs.getString("NAME"))
// Release the database
resources... rs.close()
stmt.close() con.close() catch
(SQLException se) // inform user of
errors... System.out.println("SQL Exception
" se.getMessage()) se.printStackTrace(Syst
em.out)
14Lecture Outline
- Review
- Extending OR database systems
- Java and JDBC
- Data Warehouses
- Introduction to Data Warehouses
- Data Warehousing
- (Based on lecture notes from Joachim Hammer,
University of Florida, and Joe Hellerstein and
Mike Stonebraker of UCB) -
15Overview
- Data Warehouses and Merging Information Resources
- What is a Data Warehouse?
- History of Data Warehousing
- Types of Data and Their Uses
- Data Warehouse Architectures
- Data Warehousing Problems and Issues
16Problem Heterogeneous Information Sources
Heterogeneities are everywhere
Personal Databases
World Wide Web
Scientific Databases
Digital Libraries
- Different interfaces
- Different data representations
- Duplicate and inconsistent information
Slide credit J. Hammer
17Problem Data Management in Large Enterprises
- Vertical fragmentation of informational systems
(vertical stove pipes) - Result of application (user)-driven development
of operational systems
Sales Planning
Suppliers
Num. Control
Stock Mngmt
Debt Mngmt
Inventory
...
...
...
Sales Administration
Finance
Manufacturing
...
Slide credit J. Hammer
18Goal Unified Access to Data
Personal Databases
Digital Libraries
Scientific Databases
- Collects and combines information
- Provides integrated view, uniform user interface
- Supports sharing
Slide credit J. Hammer
19The Traditional Research Approach
- Query-driven (lazy, on-demand)
Clients
Metadata
Integration System
. . .
Wrapper
Wrapper
Wrapper
. . .
Source
Source
Source
Slide credit J. Hammer
20Disadvantages of Query-Driven Approach
- Delay in query processing
- Slow or unavailable information sources
- Complex filtering and integration
- Inefficient and potentially expensive for
frequent queries - Competes with local processing at sources
- Hasnt caught on in industry
Slide credit J. Hammer
21The Warehousing Approach
- Information integrated in advance
- Stored in WH for direct querying and analysis
Slide credit J. Hammer
22Advantages of Warehousing Approach
- High query performance
- But not necessarily most current information
- Doesnt interfere with local processing at
sources - Complex queries at warehouse
- OLTP at information sources
- Information copied at warehouse
- Can modify, annotate, summarize, restructure,
etc. - Can store historical information
- Security, no auditing
- Has caught on in industry
Slide credit J. Hammer
23Not Either-Or Decision
- Query-driven approach still better for
- Rapidly changing information
- Rapidly changing information sources
- Truly vast amounts of data from large numbers of
sources - Clients with unpredictable needs
Slide credit J. Hammer
24Data Warehouse Evolution
Building the DW Inmon (1992)
Data Replication Tools
Relational Databases
Company DWs
2000
1995
1990
1985
1980
1960
1975
Information- Based Management
Data Revolution
Middle Ages
Prehistoric Times
TIME
PCs and Spreadsheets
End-user Interfaces
1st DW Article
DW Confs.
Vendor DW Frameworks
Slide credit J. Hammer
25What is a Data Warehouse?
- A Data Warehouse is a
- subject-oriented,
- integrated,
- time-variant,
- non-volatile
- collection of data used in support of management
decision making processes. - -- Inmon Hackathorn, 1994 viz. Hoffer, Chap 11
26DW Definition
- Subject-Oriented
- The data warehouse is organized around the key
subjects (or high-level entities) of the
enterprise. Major subjects include - Customers
- Patients
- Students
- Products
- Etc.
27DW Definition
- Integrated
- The data housed in the data warehouse are defined
using consistent - Naming conventions
- Formats
- Encoding Structures
- Related Characteristics
28DW Definition
- Time-variant
- The data in the warehouse contain a time
dimension so that they may be used as a
historical record of the business
29DW Definition
- Non-volatile
- Data in the data warehouse are loaded and
refreshed from operational systems, but cannot be
updated by end-users
30What is a Data Warehouse?A Practitioners
Viewpoint
- A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it in
a business context. - -- Barry Devlin, IBM Consultant
Slide credit J. Hammer
31A Data Warehouse is...
- Stored collection of diverse data
- A solution to data integration problem
- Single repository of information
- Subject-oriented
- Organized by subject, not by application
- Used for analysis, data mining, etc.
- Optimized differently from transaction-oriented
db - User interface aimed at executive decision makers
and analysts
32 Contd
- Large volume of data (Gb, Tb)
- Non-volatile
- Historical
- Time attributes are important
- Updates infrequent
- May be append-only
- Examples
- All transactions ever at WalMart
- Complete client histories at insurance firm
- Stockbroker financial information and portfolios
Slide credit J. Hammer
33Warehouse is a Specialized DB
- Standard DB
- Mostly updates
- Many small transactions
- Mb - Gb of data
- Current snapshot
- Index/hash on p.k.
- Raw data
- Thousands of users (e.g., clerical users)
- Warehouse
- Mostly reads
- Queries are long and complex
- Gb - Tb of data
- History
- Lots of scans
- Summarized, reconciled data
- Hundreds of users (e.g., decision-makers,
analysts)
Slide credit J. Hammer
34Summary
Business Information Guide
Business Information Interface
Data Warehouse
Data Warehouse Catalog
Data Warehouse Population
Operational Systems
Enterprise Modeling
Slide credit J. Hammer
35Warehousing and Industry
- Warehousing is big business
- 2 billion in 1995
- 3.5 billion in early 1997
- Predicted 8 billion in 1998 Metagroup
- WalMart has largest warehouse
- 900-CPU, 2,700 disk, 23 TB Teradata system
- 7TB in warehouse
- 40-50GB per day
Slide credit J. Hammer
36Types of Data
- Business Data - represents meaning
- Real-time data (ultimate source of all business
data) - Reconciled data
- Derived data
- Metadata - describes meaning
- Build-time metadata
- Control metadata
- Usage metadata
- Data as a product - intrinsic meaning
- Produced and stored for its own intrinsic value
- e.g., the contents of a text-book
Slide credit J. Hammer
37Data Warehousing Architecture
38Ingest
39Data Warehouse Architectures Conceptual View
- Single-layer
- Every data element is stored once only
- Virtual warehouse
- Two-layer
- Real-time derived data
- Most commonly used approach in
- industry today
Slide credit J. Hammer
40Three-layer Architecture Conceptual View
- Transformation of real-time data to derived data
really requires two steps
View level Particular informational needs
Physical Implementation of the Data Warehouse
Slide credit J. Hammer
41Issues in Data Warehousing
- Warehouse Design
- Extraction
- Wrappers, monitors (change detectors)
- Integration
- Cleansing merging
- Warehousing specification Maintenance
- Optimizations
- Miscellaneous (e.g., evolution)
Slide credit J. Hammer
42Data Warehousing Two Distinct Issues
- (1) How to get information into warehouse
- Data warehousing
- (2) What to do with data once its in warehouse
- Warehouse DBMS
- Both rich research areas
- Industry has focused on (2)
Slide credit J. Hammer
43Data Extraction
- Source types
- Relational, flat file, WWW, etc.
- How to get data out?
- Replication tool
- Dump file
- Create report
- ODBC or third-party wrappers
Slide credit J. Hammer
44Wrapper
- Converts data and queries from one data model to
another
Queries
Data Model A
Data
- Extends query capabilities for sources with
limited capabilities
Queries
Slide credit J. Hammer
45Wrapper Generation
- Solution 1 Hard code for each source
- Solution 2 Automatic wrapper generation
Wrapper Generator
Definition
Wrapper
Slide credit J. Hammer
46Data Transformations
- Convert data to uniform format
- Byte ordering, string termination
- Internal layout
- Remove, add reorder attributes
- Add key
- Add data to get history
- Sort tuples
Slide credit J. Hammer
47Monitors
- Goal Detect changes of interest and propagate to
integrator - How?
- Triggers
- Replication server
- Log sniffer
- Compare query results
- Compare snapshots/dumps
Slide credit J. Hammer
48Data Integration
- Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse - Rule-based
- Actions
- Resolve inconsistencies
- Eliminate duplicates
- Integrate into warehouse (may not be empty)
- Summarize data
- Fetch more data from sources (wh updates)
- etc.
Slide credit J. Hammer
49Data Cleansing
- Find ( remove) duplicate tuples
- e.g., Jane Doe vs. Jane Q. Doe
- Detect inconsistent, wrong data
- Attribute values that dont match
- Patch missing, unreadable data
- Notify sources of errors found
Slide credit J. Hammer
50Warehouse Maintenance
- Warehouse data ? materialized view
- Initial loading
- View maintenance
- View maintenance
Slide credit J. Hammer
51Differs from Conventional View Maintenance...
- Warehouses may be highly aggregated and
summarized - Warehouse views may be over history of base data
- Process large batch updates
- Schema may evolve
Slide credit J. Hammer
52Differs from Conventional View Maintenance...
- Base data doesnt participate in view maintenance
- Simply reports changes
- Loosely coupled
- Absence of locking, global transactions
- May not be queriable
Slide credit J. Hammer
53Warehouse Maintenance Anomalies
- Materialized view maintenance in loosely coupled,
non-transactional environment - Simple example
Slide credit J. Hammer
54Warehouse Maintenance Anomalies
Slide credit J. Hammer
55Maintenance Anomaly - Solutions
- Incremental update algorithms (ECA, Strobe, etc.)
- Research issues Self-maintainable views
- What views are self-maintainable
- Store auxiliary views so original auxiliary
views are self-maintainable
Slide credit J. Hammer
56Self-Maintainability Examples
- Sold(item,clerk,age)
- Sale(item,clerk) Emp(clerk,age)
- Inserts into Emp
- If Emp.clerk is key and Sale.clerk is foreign key
(with ref. int.) then no effect - Inserts into Sale
- Maintain auxiliary view
- Emp-?clerk,age(Sold)
- Deletes from Emp
- Delete from Sold based on clerk
Slide credit J. Hammer
57Self-Maintainability Examples
- Deletes from Sale
- Delete from Sold based on item,clerk
- Unless age at time of sale is relevant
- Auxiliary views for self-maintainability
- Must themselves be self-maintainable
- One solution all source data
- But want minimal set
-
Slide credit J. Hammer
58Partial Self-Maintainability
- Avoid (but dont prohibit) going to sources
- SoldSale(item,clerk) Emp(clerk,age)
- Inserts into Sale
- Check if clerk already in Sold, go to source if
not - Or replicate all clerks over age 30
- Or ...
Slide credit J. Hammer
59Warehouse Specification (ideally)
View Definitions
Warehouse Configuration Module
Warehouse
Integration rules
Integrator
Metadata
Change Detection Requirements
Extractor/ Monitor
Extractor/ Monitor
Extractor/ Monitor
...
Slide credit J. Hammer
60Optimization
- Update filtering at extractor
- Similar to irrelevant updates in constraint and
view maintenance - Multiple view maintenance
- If warehouse contains several views
- Exploit shared sub-views
Slide credit J. Hammer
61Additional Research Issues
- Historical views of non-historical data
- Expiring outdated information
- Crash recovery
- Addition and removal of information sources
- Schema evolution
Slide credit J. Hammer
62More Information on DW
- Agosta, Lou, The Essential Guide to Data
Warehousing. Prentise Hall PTR, 1999. - Devlin, Barry, Data Warehouse, from Architecture
to Implementation. Addison-Wesley, 1997. - Inmon, W.H., Building the Data Warehouse. John
Wiley, 1992. - Widom, J., Research Problems in Data
Warehousing. Proc. of the 4th Intl. CIKM Conf.,
1995. - Chaudhuri, S., Dayal, U., An Overview of Data
Warehousing and OLAP Technology. ACM SIGMOD
Record, March 1997.