Data Warehousing - PowerPoint PPT Presentation

About This Presentation
Title:

Data Warehousing

Description:

Data Warehousing University of California, Berkeley School of Information Management and Systems SIMS 257: Database Management Lecture Outline Lecture Outline A ... – PowerPoint PPT presentation

Number of Views:1697
Avg rating:3.0/5.0
Slides: 70
Provided by: ValuedGate1333
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing


1
Data Warehousing
  • University of California, Berkeley
  • School of Information Management and Systems
  • SIMS 257 Database Management

2
Lecture Outline
  • Review
  • Application of Object Relational DBMS the
    Berkeley Environmental Digital Library
  • Data Warehouses
  • Introduction to Data Warehouses
  • Data Warehousing
  • (Based on lecture notes from Joachim Hammer,
    University of Florida, and Joe Hellerstein and
    Mike Stonebraker of UCB)

3
Lecture Outline
  • Review
  • Application of Object Relational DBMS the
    Berkeley Environmental Digital Library
  • Data Warehouses
  • Introduction to Data Warehouses
  • Data Warehousing
  • (Based on lecture notes from Joachim Hammer,
    University of Florida, and Joe Hellerstein and
    Mike Stonebraker of UCB)

4
A Digital Library Infrastructure Model
5
UC Berkeley Digital Library Project
  • Focus Work-centered digital information
    services
  • Testbed Digital Library for the California
    Environment
  • Research Technical agenda supporting
    user-oriented access to large distributed
    collections of diverse data types.
  • Part of the NSF/NASA/DARPA Digital Library
    Initiative (Phases 1 and 2)

6
The Environmental Library - Contents
  • As of late 2002, the collection represents over
    one terabyte of data, including over 183,000
    digital images, about 300,000 pages of
    environmental documents, and over 2 million
    records in geographical and botanical databases.

7
Botanical Data
  • The CalFlora Database contains taxonomical and
    distribution information for more than 8000
    native California plants. The Occurrence Database
    includes over 600,000 records of California plant
    sightings from many federal, state, and private
    sources. The botanical databases are linked to
    the CalPhotos collection of California plants,
    and are also linked to external collections of
    data, maps, and photos.

8
Geographical Data
  • Much of the geographical data in the collection
    has been used to develop our web-based GIS
    Viewer. The Street Finder uses 500,000 Tiger
    records of S.F. Bay Area streets along with the
    70,000-records from the USGS GNIS database.
    California Dams is a database of information
    about the 1395 dams under state jurisdiction. An
    additional 11 GB of geographical data represents
    maps and imagery that have been processed for
    inclusion as layers in our GIS Viewer. This
    includes Digital Ortho Quads and DRG maps for the
    S.F. Bay Area.

9
Documents
  • Most of the 300,000 pages of digital documents
    are environmental reports and plans that were
    provided by California state agencies. This
    collection includes documents, maps, articles,
    and reports on the California environment
    including Environmental Impact Reports (EIRs),
    educational pamphlets, water usage bulletins, and
    county plans. Documents in this collection come
    from the California Department of Water Resources
    (DWR), California Department of Fish and Game
    (DFG), San Diego Association of Governments
    (SANDAG), and many other agencies. Among the most
    frequently accessed documents are County General
    Plans for every California county and a survey of
    125 Sacramento Delta fish species.

10
Multivalent Documents
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
GIS Viewer Example
http//elib.cs.berkeley.edu/annotations/gis/buildi
ngs.html
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
Blobworld use regions for retrieval
  • We want to find general objects? Represent
    images based on coherent regions

19
(No Transcript)
20
(No Transcript)
21
Lecture Outline
  • Review
  • Application of Object Relational DBMS the
    Berkeley Environmental Digital Library
  • Data Warehouses
  • Introduction to Data Warehouses
  • Data Warehousing
  • (Based on lecture notes from Joachim Hammer,
    University of Florida, and Joe Hellerstein and
    Mike Stonebraker of UCB)

22
Overview
  • Data Warehouses and Merging Information Resources
  • What is a Data Warehouse?
  • History of Data Warehousing
  • Types of Data and Their Uses
  • Data Warehouse Architectures
  • Data Warehousing Problems and Issues

23
Problem Heterogeneous Information Sources
Heterogeneities are everywhere
Personal Databases
World Wide Web
Scientific Databases
Digital Libraries
  • Different interfaces
  • Different data representations
  • Duplicate and inconsistent information

Slide credit J. Hammer
24
Problem Data Management in Large Enterprises
  • Vertical fragmentation of informational systems
    (vertical stove pipes)
  • Result of application (user)-driven development
    of operational systems

Sales Planning
Suppliers
Num. Control
Stock Mngmt
Debt Mngmt
Inventory
...
...
...
Sales Administration
Finance
Manufacturing
...
Slide credit J. Hammer
25
Goal Unified Access to Data
Personal Databases
Digital Libraries
Scientific Databases
  • Collects and combines information
  • Provides integrated view, uniform user interface
  • Supports sharing

Slide credit J. Hammer
26
The Traditional Research Approach
  • Query-driven (lazy, on-demand)

Clients
Metadata
Integration System
. . .
Wrapper
Wrapper
Wrapper
. . .
Source
Source
Source
Slide credit J. Hammer
27
Disadvantages of Query-Driven Approach
  • Delay in query processing
  • Slow or unavailable information sources
  • Complex filtering and integration
  • Inefficient and potentially expensive for
    frequent queries
  • Competes with local processing at sources
  • Hasnt caught on in industry

Slide credit J. Hammer
28
The Warehousing Approach
  • Information integrated in advance
  • Stored in WH for direct querying and analysis

Slide credit J. Hammer
29
Advantages of Warehousing Approach
  • High query performance
  • But not necessarily most current information
  • Doesnt interfere with local processing at
    sources
  • Complex queries at warehouse
  • OLTP at information sources
  • Information copied at warehouse
  • Can modify, annotate, summarize, restructure,
    etc.
  • Can store historical information
  • Security, no auditing
  • Has caught on in industry

Slide credit J. Hammer
30
Not Either-Or Decision
  • Query-driven approach still better for
  • Rapidly changing information
  • Rapidly changing information sources
  • Truly vast amounts of data from large numbers of
    sources
  • Clients with unpredictable needs

Slide credit J. Hammer
31
Data Warehouse Evolution
Building the DW Inmon (1992)
Data Replication Tools
Relational Databases
Company DWs
2000
1995
1990
1985
1980
1960
1975
Information- Based Management
Data Revolution
Middle Ages
Prehistoric Times
TIME
PCs and Spreadsheets
End-user Interfaces
1st DW Article
DW Confs.
Vendor DW Frameworks
Slide credit J. Hammer
32
What is a Data Warehouse?
  • A Data Warehouse is a
  • subject-oriented,
  • integrated,
  • time-variant,
  • non-volatile
  • collection of data used in support of management
    decision making processes.
  • -- Inmon Hackathorn, 1994 viz. Hoffer, Chap 11

33
DW Definition
  • Subject-Oriented
  • The data warehouse is organized around the key
    subjects (or high-level entities) of the
    enterprise. Major subjects include
  • Customers
  • Patients
  • Students
  • Products
  • Etc.

34
DW Definition
  • Integrated
  • The data housed in the data warehouse are defined
    using consistent
  • Naming conventions
  • Formats
  • Encoding Structures
  • Related Characteristics

35
DW Definition
  • Time-variant
  • The data in the warehouse contain a time
    dimension so that they may be used as a
    historical record of the business

36
DW Definition
  • Non-volatile
  • Data in the data warehouse are loaded and
    refreshed from operational systems, but cannot be
    updated by end-users

37
What is a Data Warehouse?A Practitioners
Viewpoint
  • A data warehouse is simply a single, complete,
    and consistent store of data obtained from a
    variety of sources and made available to end
    users in a way they can understand and use it in
    a business context.
  • -- Barry Devlin, IBM Consultant

Slide credit J. Hammer
38
A Data Warehouse is...
  • Stored collection of diverse data
  • A solution to data integration problem
  • Single repository of information
  • Subject-oriented
  • Organized by subject, not by application
  • Used for analysis, data mining, etc.
  • Optimized differently from transaction-oriented
    db
  • User interface aimed at executive decision makers
    and analysts

39
Contd
  • Large volume of data (Gb, Tb)
  • Non-volatile
  • Historical
  • Time attributes are important
  • Updates infrequent
  • May be append-only
  • Examples
  • All transactions ever at WalMart
  • Complete client histories at insurance firm
  • Stockbroker financial information and portfolios

Slide credit J. Hammer
40
Warehouse is a Specialized DB
  • Standard DB
  • Mostly updates
  • Many small transactions
  • Mb - Gb of data
  • Current snapshot
  • Index/hash on p.k.
  • Raw data
  • Thousands of users (e.g., clerical users)
  • Warehouse
  • Mostly reads
  • Queries are long and complex
  • Gb - Tb of data
  • History
  • Lots of scans
  • Summarized, reconciled data
  • Hundreds of users (e.g., decision-makers,
    analysts)

Slide credit J. Hammer
41
Summary
Business Information Guide
Business Information Interface
Data Warehouse
Data Warehouse Catalog
Data Warehouse Population
Operational Systems
Enterprise Modeling
Slide credit J. Hammer
42
Warehousing and Industry
  • Warehousing is big business
  • 2 billion in 1995
  • 3.5 billion in early 1997
  • Predicted 8 billion in 1998 Metagroup
  • WalMart has largest warehouse
  • 900-CPU, 2,700 disk, 23 TB Teradata system
  • 7TB in warehouse
  • 40-50GB per day

Slide credit J. Hammer
43
Types of Data
  • Business Data - represents meaning
  • Real-time data (ultimate source of all business
    data)
  • Reconciled data
  • Derived data
  • Metadata - describes meaning
  • Build-time metadata
  • Control metadata
  • Usage metadata
  • Data as a product - intrinsic meaning
  • Produced and stored for its own intrinsic value
  • e.g., the contents of a text-book

Slide credit J. Hammer
44
Data Warehousing Architecture
45
Ingest
46
Data Warehouse Architectures Conceptual View
  • Single-layer
  • Every data element is stored once only
  • Virtual warehouse
  • Two-layer
  • Real-time derived data
  • Most commonly used approach in
  • industry today

Slide credit J. Hammer
47
Three-layer Architecture Conceptual View
  • Transformation of real-time data to derived data
    really requires two steps

View level Particular informational needs
Physical Implementation of the Data Warehouse
Slide credit J. Hammer
48
Issues in Data Warehousing
  • Warehouse Design
  • Extraction
  • Wrappers, monitors (change detectors)
  • Integration
  • Cleansing merging
  • Warehousing specification Maintenance
  • Optimizations
  • Miscellaneous (e.g., evolution)

Slide credit J. Hammer
49
Data Warehousing Two Distinct Issues
  • (1) How to get information into warehouse
  • Data warehousing
  • (2) What to do with data once its in warehouse
  • Warehouse DBMS
  • Both rich research areas
  • Industry has focused on (2)

Slide credit J. Hammer
50
Data Extraction
  • Source types
  • Relational, flat file, WWW, etc.
  • How to get data out?
  • Replication tool
  • Dump file
  • Create report
  • ODBC or third-party wrappers

Slide credit J. Hammer
51
Wrapper
  • Converts data and queries from one data model to
    another

Queries
Data Model A
Data
  • Extends query capabilities for sources with
    limited capabilities

Queries
Slide credit J. Hammer
52
Wrapper Generation
  • Solution 1 Hard code for each source
  • Solution 2 Automatic wrapper generation

Wrapper Generator
Definition
Wrapper
Slide credit J. Hammer
53
Data Transformations
  • Convert data to uniform format
  • Byte ordering, string termination
  • Internal layout
  • Remove, add reorder attributes
  • Add key
  • Add data to get history
  • Sort tuples

Slide credit J. Hammer
54
Monitors
  • Goal Detect changes of interest and propagate to
    integrator
  • How?
  • Triggers
  • Replication server
  • Log sniffer
  • Compare query results
  • Compare snapshots/dumps

Slide credit J. Hammer
55
Data Integration
  • Receive data (changes) from multiple
    wrappers/monitors and integrate into warehouse
  • Rule-based
  • Actions
  • Resolve inconsistencies
  • Eliminate duplicates
  • Integrate into warehouse (may not be empty)
  • Summarize data
  • Fetch more data from sources (wh updates)
  • etc.

Slide credit J. Hammer
56
Data Cleansing
  • Find ( remove) duplicate tuples
  • e.g., Jane Doe vs. Jane Q. Doe
  • Detect inconsistent, wrong data
  • Attribute values that dont match
  • Patch missing, unreadable data
  • Notify sources of errors found

Slide credit J. Hammer
57
Warehouse Maintenance
  • Warehouse data ? materialized view
  • Initial loading
  • View maintenance
  • View maintenance

Slide credit J. Hammer
58
Differs from Conventional View Maintenance...
  • Warehouses may be highly aggregated and
    summarized
  • Warehouse views may be over history of base data
  • Process large batch updates
  • Schema may evolve

Slide credit J. Hammer
59
Differs from Conventional View Maintenance...
  • Base data doesnt participate in view maintenance
  • Simply reports changes
  • Loosely coupled
  • Absence of locking, global transactions
  • May not be queriable

Slide credit J. Hammer
60
Warehouse Maintenance Anomalies
  • Materialized view maintenance in loosely coupled,
    non-transactional environment
  • Simple example

Slide credit J. Hammer
61
Warehouse Maintenance Anomalies
Slide credit J. Hammer
62
Maintenance Anomaly - Solutions
  • Incremental update algorithms (ECA, Strobe, etc.)
  • Research issues Self-maintainable views
  • What views are self-maintainable
  • Store auxiliary views so original auxiliary
    views are self-maintainable

Slide credit J. Hammer
63
Self-Maintainability Examples
  • Sold(item,clerk,age)
  • Sale(item,clerk) Emp(clerk,age)
  • Inserts into Emp
  • If Emp.clerk is key and Sale.clerk is foreign key
    (with ref. int.) then no effect
  • Inserts into Sale
  • Maintain auxiliary view
  • Emp-?clerk,age(Sold)
  • Deletes from Emp
  • Delete from Sold based on clerk

Slide credit J. Hammer
64
Self-Maintainability Examples
  • Deletes from Sale
  • Delete from Sold based on item,clerk
  • Unless age at time of sale is relevant
  • Auxiliary views for self-maintainability
  • Must themselves be self-maintainable
  • One solution all source data
  • But want minimal set

Slide credit J. Hammer
65
Partial Self-Maintainability
  • Avoid (but dont prohibit) going to sources
  • SoldSale(item,clerk) Emp(clerk,age)
  • Inserts into Sale
  • Check if clerk already in Sold, go to source if
    not
  • Or replicate all clerks over age 30
  • Or ...

Slide credit J. Hammer
66
Warehouse Specification (ideally)
View Definitions
Warehouse Configuration Module
Warehouse
Integration rules
Integrator
Metadata
Change Detection Requirements
Extractor/ Monitor
Extractor/ Monitor
Extractor/ Monitor
...
Slide credit J. Hammer
67
Optimization
  • Update filtering at extractor
  • Similar to irrelevant updates in constraint and
    view maintenance
  • Multiple view maintenance
  • If warehouse contains several views
  • Exploit shared sub-views

Slide credit J. Hammer
68
Additional Research Issues
  • Historical views of non-historical data
  • Expiring outdated information
  • Crash recovery
  • Addition and removal of information sources
  • Schema evolution

Slide credit J. Hammer
69
More Information on DW
  • Agosta, Lou, The Essential Guide to Data
    Warehousing. Prentise Hall PTR, 1999.
  • Devlin, Barry, Data Warehouse, from Architecture
    to Implementation. Addison-Wesley, 1997.
  • Inmon, W.H., Building the Data Warehouse. John
    Wiley, 1992.
  • Widom, J., Research Problems in Data
    Warehousing. Proc. of the 4th Intl. CIKM Conf.,
    1995.
  • Chaudhuri, S., Dayal, U., An Overview of Data
    Warehousing and OLAP Technology. ACM SIGMOD
    Record, March 1997.
Write a Comment
User Comments (0)
About PowerShow.com