Chapters 24, 26, 27 Database System Architectures

About This Presentation

Title:

Chapters 24, 26, 27 Database System Architectures

Description:

Chapters 24, 26, 27 Database System Architectures Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut – PowerPoint PPT presentation

Number of Views:530

Avg rating:3.0/5.0

Slides: 103

Provided by: Stev114

Category:

more less

Transcript and Presenter's Notes

Title: Chapters 24, 26, 27 Database System Architectures

1
Chapters 24, 26, 27Database System Architectures
Prof. Steven A. Demurjian, Sr. Computer Science
Engineering Department The University of
Connecticut 191 Auditorium Road, Box
U-155 Storrs, CT 06269-3155
steve_at_engr.uconn.edu http//www.engr.uconn.edu/st
eve (860) 486 - 4818

A portion of these slides are being used with the
permission of Dr. Ling Lui, Associate Professor,
College of Computing, Georgia Tech.
Remaining slides represent new material.

2
Classical and Distributed Architectures

Classic/Centralized DBMS Dominated the Commercial
Market from 1970s Forward
Problems of this Approach
Difficult to Scale w.r.t. Performance Gains
If DB Overloaded, replace with a Faster Computer
this can Only Go So Far - Disk Bottlenecks
Distributed DBMS have Evolved to Address a Number
of Issues
Improved Performance
Putting Data Near Location where it is Needed
Replication of Data for Fault Tolerance
Vertical and Horizontal Partitioning of DB Tuples

3
Common Features of Centralized DBMS

Data Independence
High-Level Representation via Conceptual and
External Schemas
Physical Representation (Internal Schema) Hidden
Program Independence
Multiple Applications can Share Data
Views/External Schema Support this Capability
Reduction of Program/Data Redundancy
Single, Unique, Conceptual Schema
Shared Database
Almost No Data Redundancy
Controlled Data Access Reduces Inconsistencies
Programs Execute with Consistent Results

4
Common Features of Centralized DBMS

Promote Sharing Automatically Provided via CC
No Longer Programmatic Issue
Most DBMS Offer Locking for Key Shared Data
Oracle Allows Locks on Data Item (Attributes)
For Example, Controlling Access to Shared
Identifier
Coherent and Central DB Administration
Semantic DB Integrity via the Automatic
Enforcement of Data Consistency via Integrity
Constraints/Rules
Data Resiliency
Physical Integrity of Data in the Presence of
Faults and Errors
Supported by DB Recovery
Data Security Control Access for Authorized
Users Against Sensitive Data

5
Shared Nothing Architecture

In this Architecture, Each DBMS Operates
Autonomously
there is No Sharing
Three Separate DBMSs on Three Different Computers
Applications/Clients Must Know About the External
Schemas of all Three DBMSs for
Database Retrieval
Client Processing
Complicates Client
Different DBMS Platforms(Oracle, Sybase,
Informix, ..)
Different Access Modes(Query, Embedded, ODBC)
Difficult for SWE to Code

6
NW Architecture with Centralized DB

High-Speed NWs/WANs Spawned Centralized DB
Accessible Worldwide
Clients at Any Site can Access Repository
Data May be Far Away - Increased Access Time
In Practice, Each Remote Site Needs only Portion
of the Data in DB1 and/or DB2
Inefficient, no Replication w.r.t. Failure

7
Fully Distributed Architecture

The Five Sites (Chicago, SF, LA, NY, Atlanta)
each have a Portion of the Database - its
Distributed
Replication is Possible for Fault Tolerance
Queries at one Site May Need to Access Data at
Another Site (e.g., for a Join)
Increased Transaction Processing Complexity

8
Goals of DDBMS

Support User Distribution Across Multiple Sites
Remote Access by Users Regardless of Location
Distribution and Replication of Database Content
Provide Location Transparency
Users Manipulate their Own Data
Non-Local Sites Appear Local to Any User
Provide Transaction Control Akin to Centralized
Case
Transaction Control Hides Distribution
CC and Serializability - Must be Extended
Minimize Communications Cost
Optimize Use of Network - a Critical Issue
Distribute DB Design Supported by Partitioning
(Fragmentation) and Replication

9
Goals of DDBMS

Improve Response Time for DB Access
Use a More Sophisticated Load Control for
Transaction Processing
However, Synchronization Across Sites May
Introduce Additional Overhead
System Availability
Site Independence in the Presence of Site Failure
Subset of Database is Always Available
Replication can Keep All Data Available, Even
When Multiple Sites Fail
Modularity
Incremental Growth with the Addition of Sites
Dedicate Sites to Specific Tasks

10
Advantages of DDBMS

There are Four Major Advantages
Transparency
Distribution/NW Transparency
User Doesnt Know about NW Configuration
(Location Transparency)
User can Find Object at any Site (Naming
Transparency)
Replication Transparency (see next PPT)
User Doesnt Know Location of Data
Replicas are Transparently Accessible
Fragmentation Transparency
Horizontal Fragmentation (Distribute by Row)
Vertical Fragmentation (Distribute by Column)

11
Data Distribution and Replication
12
Advantages of DDBMS

Increased Reliability and Availability
Reliability - System Always Running
Availability - Data Always Present
Achieved via Replication and Distribution
Improved Performance
Sites Able to Utilize Data that is Local for
Majority of Queries
Easier Expansion
Improve Performance of Site by
Upgrading Processor of Computer
Adding Additional Disks
Splitting a Site into Two or More Sites
Expansion over Time as Business Grows

13
Challenges of DDBMS

Tracking Data - Meta Data More Complex
Must Track Distribution (where is the Data)
V H Fragmentation (How is Data Split)
Replication (Multiple Copies for Consistency)
Distributed Query Processing
Optimization, Accessibility, etc., More Complex
Block Analysis of Data Size Must also Now
Consider the NW Transmitting Time
Distributed Transaction Processing
TP Potentially Spans Multiple Sites
Submit Query to Multiple Sites
Collect and Collate Results

14
Challenges of DDBMS

Replicated Data Management
TP Must Choose the Replica to Access
Updates Must Modify All Replica Copies
Distributed Database Recovery
Recovery of Individual Sites
Recovery Across DDBMS
Security
Local and Remote Authorization
During TP, be Able to Verify Remote Privileges
Distributed Directory Management
Meta-Data on Database - Local and Remote
Must maintain Replicas of this - Every Site
Tracks the Meta-Data for All Sites

15
A Complete Schema with Keys ...
Keys Allow us to Establish Links Between
Relations
what is this Similar to in ER?
16
and Corresponding DB Tables
which Represent Tuples/Instances of Each Relation
A S C null W B null null
1 4 5 5
17
with Remaining DB Tables
18
What is Fragmentation?

Fragmentation Divides a DB Across Multiple Sites
Two Types of Fragmentation
Horizontal Fragmentation
Given a Relation R with n Total Tuples, Spread
Entire Tuples Across Multiple Sites
Each Site has a Subset of the n Tuples
Essentially Fragmentation is a Selection
Vertical Fragmentation
Given a Relation R with m Attributes and n Total
Tuples, Spread the Columns Across Multiple Sites
Essentially Fragmentation is a Projection
In Both Cases, Sites can Overlap for Replication

19
Horizontal Fragmentation

Site 2 Tracks All Information Related to Dept. 5

20
Horizontal Fragmentation

Site 3 Tracks All Information Related to Dept. 4
Note that an Employee Could be Listed in Both
Cases, if s/he Works on a Project for Both
Departments

21
Refined Horizontal Fragmentation

Further Fragment from Site 2 based on Dept. that
Employee Works in
Notice that G1 G2 G3 is the Same as WORKS_ON5
there is no Overlap

22
Refined Horizontal Fragmentation

Further Fragment from Site 3 based on Dept. that
Employee Works in
Notice that G4 G5 G6 is the Same as WORKS_ON4
Note Some Fragments can be Empty

23
Homogeneous DDBMS

Homogeneous
Identical Software (w.r.t. Database)
One DB Product (e.g., Oracle) is Distributed and
Available at All Sites
Uniformity w.r.t. Administration, Maintenance,
Client Access, Users, Security, etc.
Interaction by Programmatic Clients is Consistent
(e.g., JDBC or ODBC or )

24
Non-Federated Heterogeneous DDBMS

Non-Federated Heterogeneous
Different Software (w.r.t. Database)
Multiple DB Products (e.g., Oracle at One Site,
Access another, Sybase, Informix, etc.)
Replicated Administration (e.g., Users Needs
Accounts on Multiple Systems)
Varied Programmatic Access - SWEs Must Know All
Platforms/Client Software Complicated

25
Federated DDBMS

Federated
Multiple DBMS Platforms Overlaid with a Global
Schema View
Single External Schema Combines Schemas from all
Sites
Multiple Data Models
Relational in one Component DBS
Object in another Component DBS
Hierarchical in a 3rd DBS

26
Federated DBMS Issues

Differences in Data Models
Reconcile Relation vs. Object-Oriented Models
Each Different Model has Different Capabilities
These Differences Must be Addressed in Order to
Present a Federated Schema
Differences in Constraints
Referential Integrity Constraints in Different
DBSs
Different Constraints on Similar Data
Federated Schema Must Deal with these Conflicts
Differences in Query Languages
SQL-89, SQL-92, SQL2, SQL3
Specific Types in Different DBMS (Blobs in Oracle)

27
Databases on WWW

Web has changed the way we do Business Research
Facts
Industry Saw an Opportunity, knew it had to Move
Quickly to Capitalize
Lots of Action, Lots of Money, Lots of Releases
Line Between RD is Very Narrow
Many Researchers Moved to Industry (Trying to
Return Back to Academia)
Emergence of Java
Java changed the way that Software was Designed,
Developed, and Utilized
Particularly w.r.t. Web-Based Applications,
Database Interoperability, Web Architectures,
etc.
Emergence of Enterprise Computing

28
Internet and the Web

A Major Opportunity for Business
A Global Marketplace
Business Across State and Country Boundaries
A Way of Extending Services
Online Payment vs. VISA, Mastercard
A Medium for Creation of New Services
Publishers, Travel Agents, Teller, Virtual Yellow
Pages, Online Auctions
A Boon for Academia
Research Interactions and Collaborations
Free Software for Classroom/Research Usage
Opportunities for Exploration of Technologies in
Student Projects

29
Market Forecast New Internet Servers
30
WWW Three Market Segments
Server

Business to Business
Information sharing
Ordering info./status
Targeted electronic commerce

Corporate Network
Server
Internet

Intranet
Decision support
Mfg.. System monitoring
corporate repositories
Workgroups

Corporate Network
Server
Server

Internet
Sales
Marketing
Information
Services

31
Information Delivery Problems on the Net

Everyone can Publish Information on the Web
Independently at Any Time
Consequently, there is an Information Explosion
Identifying Information Content More Difficult
There are too Many Search Engines but too Few
Capable of Returning High Quality Data
Most Search Engines are Useful for Ad-hoc
Searches but Awkward for Tracking Changes

32
Example Web Applications

Scenario 1 World Wide Wait
A Major Event is Underway and the Latest,
Up-to-the Minute Results are Being Posted on the
Web
You Want to Monitor the Results for this
Important Event, so you Fire up your Trusty Web
Browser, Pointing at the Result Posting Site, and
Wait, and Wait, and Wait
What is the Problem?
The Scalability Problems are the Result of a
Mismatch Between the Data Access Characteristics
of the Application and the Technology Used to
Implement the Application

33
Example Web Applications

Scenario 2
Many Applications Today have the Need for
Tracking Changes in Local and Remote Data Sources
and Notifying Changes If Some Condition Over the
Data Source(s) is Met
If You Want to Monitor the Changes on Web, You
Need to Fire Your Trusty Web Browser from Time to
Time, and Cache the Most Recent Result, and do
the Difference Manually Each Time You Poll the
Data Source(s)
What is the Problem?
Pure Pull is Not the Answer to All Problems

34
What is the Problem?

Applications are Asymmetric but the Web is Not
Computation Centric vs. Information Flow Centric
Type of Asymmetry
Network Asymmetry
Satellite, CATV, Mobile Clients, Etc.
Client to Server Ratio
Too Many Clients can Swamp Servers
Data Volume
Mouse and Key Click vs. Content Delivery
Update and Information Creation
Clients Need to be Informed or Must Poll

35
Useful Solutions

Combination/Interleave of Pull and Push Protocols
User-initiated, Comprehensive Search-based
Information Delivery (Pull)
Server-initiated Information Dissemination (Push)
Provide Support for a Variety of Data Delivery
Protocols, Frequencies, and Delivery Modes
Information Delivery Frequencies
Periodic, Conditional, Ad-Hoc
Information Delivery Modes
Information Delivery Protocols (IDP)
Request/Respond, Polling, Publish/Subscribe,
Broadcast
Information Delivery Styles (IDS)
Pull, Push, Hybrid

36
Information Delivery Frequencies

Periodic
Data is Delivered from a Server to Clients
Periodically
Period can be Defined by System-default or by
Clients Using their Profiles
Period can be Influenced by Client and Bandwidth
PDA/Cell Phone vs. PC w/Modem
PC w/DSL vs. PC w/Cable Modem
Conditional (Aperiodic)
Data is Delivered from a Server when Conditions
Installed by Clients in their Profiles are
Satisfied
Ad-hoc (or Irregular)

37
Information Delivery Modes

Uni-cast
Data is Sent from a Data Source (a Single Server)
to Another Machine
1-to-n
Data is Sent by a Single Data Source and Received
by Multiple Machines
Multicast vs. Broadcast
Multicast Data is Sent to a Specific Set of
Clients
Broadcast Sending Data Over a Medium which an
Unidentified or Unbounded Set of Clients can
Listen

38
IDP Request/Respond

Semantics of Request/Respond
Clients Send their Request to Servers to Ask the
Information of their Interest
Servers Respond to the Client Request by
Delivering the Information Requested
Client can Wait (Synchronous) or Not
Applications
Most Database Systems and Web Search Engines are
Using the Request/Respond Protocol for
Client-Server Communication

39
IDP Programmed Polling vs. User Polling

Semantics
Programmed Polling a System Periodically Sends
Requests to Other Sites to Obtain Status
Information or Detect Changed Values
User Polling a User or Application Periodically
or Aperiodically Polls the Data Sites and Obtains
the Changes
Applications
Programmed Polling Save the Users from having to
Click, but does Nothing to Solve the Scalability
Problems Caused by the Request/Respond Mechanism

40
IDP Publish/Subscribe

Semantics Servers Publish/Clients Subscribe
Servers Publish Information Online
Clients Subscribe to the Information of Interest
(Subscription-based Information Delivery)
Data Flow is Initiated by the Data Sources
(Servers) and is Aperiodic
Danger Subscriptions can Lead to Other Unwanted
Subscriptions
Applications
Unicast Database Triggers and Active Databases
1-to-n Online News Groups

41
IDP Broadcast

Semantics Server Broadcast/Client(s) Listen
Clients who Require Access to a Data Item Need to
Wait Until the Item Appears
Selective Broadcast vs. Random Broadcast
Applications
Unicast Doesnt Address Network Bandwidth
Problems
1-to-n Most Likely to Prevail

42
Information Delivery Styles

Pull-Based System
Transfer of Data from Server to Client is
Initiated by a Client Pull
Clients Determine when to Get Information
Potential for Information to be Old Unless Client
Periodically Pulls
Push-Based System
Transfer of Data from Server to Client is
Initiated by a Server Push
Clients may get Overloaded if Push is Too
Frequent
Hybrid
Pull and Push Combined
Pull First and then Push Continually

43
Summary Pull vs. Push
Request/
Publish/
Broadcast
Periodic
Conditional
Ad-hoc
Respond
Subscribe
Pure Pull
Y
Y
Pure Push
Y
Y
Y
Y
Hybrid
Y
Y
Y
Y
Y
Y
44
Design Options for Nodes

Three Types of Nodes
Data Sources
Provide Base Data which is to be Disseminated
Clients
Who are the Net Consumers of the Information
Information Brokers
which Acquire Information from Other Data
Sources, Add Value to that Information and then
Distribute this Information to Other Consumers
By Creating a Hierarchy of Brokers, Information
Delivery can be Tailored to the Need of Many Users

45
The Next Big Challenge

Interoperability
Heterogeneous Distributed Databases
Heterogeneous Distributed Systems
Autonomous Applications
Scalability
Rapid and Continuous Growth
Amount of Data
Variety of Data Types
Different Privacy Levels or Ownerships of Data

46
Interoperability A Classic View
Simple Federation
Multiple Nested Federation
FDB Global Schema
FDB Global Schema 4
Federated Integration
Federated Integration
Local Schema
Local Schema
Local Schema
FDB 1
Local Schema
FDB3
Federation
Federation
47
CORBA-Based Development
IDL file
Object Implementation
Client Application
IDL Compiler
IDL Compiler
Stub
Skeleton
ORB/IIOP
ORB/IIOP
48
ORB IntegrationJava Client Legacy Application
Java Client
Legacy Application
Java Wrapper
Object Request Broker (ORB)
CORBA is the Medium of Info. Exchange Requires
Java/CORBA Capabilities
49
Java Client with Wrapper to Legacy Application
Java Client
Interactions Between Java Client and Legacy Appl.
via C and RPC C is the Medium of Info.
Exchange Java Client with C/C Wrapper
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
NATIVE LAYER
Native Functions (C) RPC Client Stubs (C)
Legacy Application
Network
50
COTS and Legacy Appls. to Java Clients
COTS Application
Legacy Application
Java Application Code
Java Application Code
Native Functions that Map to COTS Appl
Native Functions that Map to Legacy Appl
NATIVE LAYER
NATIVE LAYER
JAVA LAYER
JAVA LAYER
Mapping Classes
Mapping Classes
JAVA NETWORK WRAPPER
JAVA NETWORK WRAPPER
Network
Java Client
Java Client
Java is Medium of Info. Exchange - C/C Appls
with Java Wrappers
51
Database Interoperability in the Internet

Technology
Web/HTTP, JDBC/ODBC, CORBA (ORBs IIOP), XML
Architecture

52
Java Client to Legacy App via RDBS
Transformed Legacy Data
Java Client
Relational Database System(RDS)
Updated Data
Extract and Generate Data
Transform and Store Data
Legacy Application
53
JDBC

JDBC API Provides DB Access Protocols for Open,
Query, Close, etc.
Different Drivers for Different DB Platforms

JDBC API
Java Application
Driver Manager
Driver
Driver
Oracle
Sybase
Access
54
Connecting a DB to the Web

Web Server are Stateless
DB Interactions Tend to be Stateful
Invoking a CGI Script on Each DB Interaction is
Very Expensive, Mainly Due to the Cost of DB Open

DBMS
CGI Script Invocation or JDBC Invocation
Web Server
Internet
Browser
55
Connecting More Efficiently

To Avoid Cost of Opening Database, One can Use
Helper Processes that Always Keep Database Open
and Outlive Web Connection
Newly Invoked CGI Scripts Connect to a
Preexisting Helper Process
System is Still Stateless

DBMS
Helper Processes
CGI Script or JDBC Invocation
Web Server
Internet
Browser
56
DB-Internet Architecture
WWW Client (Netscape)
WWW Client (HotJava)
WWW client (Info. Explore)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Dispatcher
DBWeb Gateway
DBWeb Gateway
57
EJB Architecture
58
Technology Push

Computer/Communication Technology (Almost Free)
Plenty of Affordable CPU, Memory, Disk, Network
Bandwidth
Next Generation Internet Gigabit Now
Wireless Ubiquitous, High Bandwidth
Information Growth
Massively Parallel Generation of Information on
the Internet and from New Generation of Sensors
Disk Capacity on the Order of Peta-bytes
Small, Handy Devices to Access Information

59
More Information Moves to Cyberspace

What Grows Faster than Internet?
Why?
Low Rent 10x Cheaper100 Letters (1MB) Disk
10 File Cabinet 5001 Picture Disk
10 Printed 40
Easy Access and Search
Robot can Find All Docs Matching a Predicate
Access from Anywhere
Human Costs 15/hr
How Much Information is there?

Gray, Turing Lecture 1998, ACM FCRC Atlanta
60
Sizes of Search Engines
KEY AVAltaVista, NLNorthern Light,
INKInktomi, EXExcite, LYLycos, ISInfoseek,
WCWebCrawler
as of February, 1999
61
Research Challenges

Inherent Complexity
Coping with Latency (Sometimes Unpredictable)
Failure Detection and Recovery (Partial Failure)
Concurrency, Load Balancing, Availability, Scale
Service Partitioning
Ordering of Distributed Events
Accidental Complexity
Heterogeneity Beyond the Local Case Platform,
Protocol, Plus All Local Heterogeneity in Spades.
Autonomy Change and Evolve Autonomously
Tool Deficiencies Language Support
(Sockets,rpc), Debugging, Etc.

Ubiquitous/Pervasive Many computers and
information appliances everywhere, networked
together
62
Infosphere
Problem too many sources,too much information
63
Current State-of-Art
Web Server
Mainframe Database Server
Thin Client
64
Infosphere Scenario
Infotaps Fat Clients
Sensors
Many sources
Database Server
65
Heterogeneity and Autonomy

Heterogeneity
How Much can we Really Integrate?
Syntactic Integration
Different Formats and Models
Web/SQL Query Languages
Semantic Interoperability
Basic Research on Ontology, Etc
Autonomy
No Central DBA on the Net
Independent Evolution of Schema and Content
Interoperation is Voluntary
Interface Technology (Support for Isvs)
DCOM Microsoft Standard
CORBA, Etc...

66
Security and Data Quality

Security
System Security in the Broad Sense
Attacks Penetrations, Denial of Service
System (and Information) Survivability
Security Fault Tolerance
Replication for Performance, Availability, and
Survivability
Data Quality
Web Data Quality Problems
Local Updates with Global Effects
Unchecked Redundancy (Mutual Copying)
Registration of Unchecked Information
Spam on the Rise

67
Legacy Data Challenge

Legacy Applications and Data
Definition Important and Difficult to Replace
Typically, Mainframe Mission Critical Code
Most are OLTP and Database Applications
Evolution of Legacy Databases
Client-server Architectures
Wrappers
Expensive and Gradual in Any Case

68
Potential Value Added/Jumping on Bandwagon

Sophisticated Query Capability
Combining SQL with Keyword Queries
Consistent Updates
Atomic Transactions and Beyond
But Everything has to be in a Database!
Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions
Interoperable Query Processing
Extended Transaction Updates
Commodities DB Software
A Little Help is Still Good If it is Cheap
Internet Facilitates Software Distribution
Databases as Middleware

69
Data Warehousing and Data Mining

Data Warehousing
Provide Access to Data for Complex Analysis,
Knowledge Discovery, and Decision Making
Underlying Infrastructure in Support of Mining
Provides Means to Interact with Multiple DBs
OLAP (on-Line Analytical Processing) vs. OLTP
Data Mining
Discovery of Information in a Vast Data Sets
Search for Patterns and Common Features based
Discover Information not Previously Known
Medical Records Accessible Nationwide
Research/Discover Cures for Rare Diseases
Relies on Knowledge Discovery in DBs (KDD)

70
Data Warehousing and OLAP

A Data Warehouse
Database is Maintained Separately from an
Operational Database
A Subject-Oriented, Integrated, Time-Variant,
and Non-Volatile Collection of Data in Support
for Managements Decision Making Process
W.H.Inmon
OLAP (on-Line Analytical Processing)
Analysis of Complex Data in the Warehouse
Attempt to Attain Value through Analysis
Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information
Data Mart
Organized Data for a Subset of an Organization

71
Building a Data Warehouse

Option 1
Leverage Existing Repositories
Collate and Collect
May Not Capture All Relevant Data

Option 2
Start from Scratch
Utilize Underlying Corporate Data

Corporate data warehouse
Option 1 Consolidate Data Marts
Option 2 Build from scratch
Data Mart
Data Mart
Data Mart
Data Mart
...
Corporate data
72
Data Warehouse Characteristics

Utilizes a Multi-Dimensional Data Model
Warehouse Comprised of
Store of Integrated Data from Multiple Sources
Processed into Multi-Dimensional Model
Warehouse Supports of
Times Series and Trend Analysis
Super-Excel Integrated with DB Technologies
Data is Less Volatile than Regular DB
Doesnt Dramatically Change Over Time
Updates at Regular Intervals
Specific Refresh Policy Regarding Some Data

73
Three Tier Architecture
monitor
OLAP Server
integrator
External data sources
Summarization report
Extraxt Transform Load Refresh
Operational databases
Data Warehouse
serve
Query report
Data mining
metadata
Data marts
74
Data Warehouse Design

Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model
Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional
Coordinates and Stores Measures for those
Coordinates
A Fact Table Connects All Dimension Tables with a
Multiple Join
Each Tuple in Fact Table Represents the Content
of One Dimension
Each Tuple in the Fact Table Consists of a
Pointer to Each of the Dimensional Tables
Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star

75
What is a Multi-Dimensional Data Cube?

Representation of Information in Two or More
Dimensions
Typical Two-Dimensional - Spreadsheet
In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful

76
Multi-Dimensional Schemas

Supporting Multi-Dimensional Schemas Requires Two
Types of Tables
Dimension Table Tuples of Attributes for Each
Dimension
Fact Table Measured/Observed Variables with
Pointers into Dimension Table
Star Schema
Characterizes Data Cubes by having a Single Fact
Table for Each Dimension
Snowflake Schema
Dimension Tables from Star Schema are Organized
into Hierarchy via Normalization
Both Represent Storage Structures for Cubes

77
Example of Star Schema
Date
Sale Fact Table
Date Month Year
Date Product Store Customer Unit_Sales Dollar_Sale
s
Store
StoreID City State Country Region
78
A Second Example of Star Schema
79
and Corresponding Snowflake Schema
80
Data Warehouse Issues

Data Acquisition
Extraction from Heterogeneous Sources
Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent
Data Cleaning for Validity and Qualityis the
Data as Expected w.r.t. Content? Value?
Transition of Data into Data Model of Warehouse
Loading of Data into the Warehouse
Other Issues Include
How Current is the Data? Frequency of Update?
Availability of Warehouse? Dependencies of Data?
Distribution, Replication, and Partitioning
Needs?
Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?

81
OLAP Strategies

OLAP Strategies
Roll-Up Summarization of Data
Drill-Down from the General to Specific
(Details)
Pivot Cross Tabulate the Data Cubes
Slide and Dice Projection Operations Across
Dimensions
Sorting Ordering Result Sets
Selection Access by Value or Value Range
Implementation Issues
Persistent with Infrequent Updates (Loading)
Optimization for Performance on Queries is More
Complex - Across Multi-Dimensional Cubes
Recovery Less Critical - Mostly Read Only
Temporal Aspects of Data (Versions) Important

82
Knowledge Discovery

Data Warehousing Requires Knowledge Discovery to
Organize/Extract Information Meaningfully
Knowledge Discovery
Technology to Extract Interesting Knowledge
(Rules, Patterns, Regularities, Constraints) from
a Vast Data Set
Process of Non-trivial Extraction of Implicit,
Previously Unknown, and Potentially Useful
Information from Large Collection of Data
Data Mining
A Critical Step in the Knowledge Discovery
Process
Extracts Implicit Information from Large Data Set

83
Steps in a KDD Process

Learning the Application Domain (goals)
Gathering and Integrating Data
Data Cleaning
Data Integration
Data Transformation/Consolidation
Data Mining
Choosing the Mining Method(s) and Algorithm(s)
Mining Search for Patterns or Rules of Interest
Analysis and Evaluation of the Mining Results
Use of Discovered Knowledge in Decision Making
Important Caveats
This is Not an Automated Process!
Requires Significant Human Interaction!

84
On-Line Analytical Processing

Data Cube
A Multidimensonal Array
Each Attribute is a Dimension
In Example Below, the Data Must be Interpreted so
that it Can be Aggregated by Region/Product/Date

Product
Pants
Diapers
Beer
Nuts
West East Central Mountain South
Region
Jan Feb March April
Date
85
Examples of Data Mining

The Slicing Action
A Vertical or Horizontal Slice Across Entire Cube

Slice on city Atlanta
Multi-Dimensional Data Cube
86
Examples of Data Mining

The Dicing Action
A Slide First Identifies on Dimension
A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions

March 2000
Atlanta
Electronics
Dice on Electronics and Atlanta
87
Examples of Data Mining

Drill Down - Takes a Facet (e.g., Q1)
and Decomposes into Finer Detail

Drill down on Q1
Atlanta Columbus Gainesville Savannah
Arizona California Georgia Iowa
Q1 Q2 Q3 Q4
Roll Up on Location (State, USA)
Products Sales
Roll Up Combines Multiple Dimensions From
Individual Cities to State
88
Mining Other Types of Data

Analysis and Access Dramatically More Complicated!

Spatial databases Multimedia databases World
Wide Web
Time series data
Geographical and Satellite Data
89
Advantages/Objectives of Data Mining

Descriptive Mining
Discover and Describe General Properties
60 People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months
Predictive Mining
Infer Interesting Properties based on Available
Data
People who Buy Beer on Friday usually also Buy
Nuts or Chips
Result of Mining
Order from Chaos
Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn
about Trends, Behavior, etc.
Impact on Marketing Strateg

90
Data Mining Methods (1)

Association
Discover the Frequency of Items Occurring
Together in a Transaction or an Event
Example
80 Customers who Buy Milk also Buy BreadHence -
Bread and Milk Adjacent in Supermarket
50 of Customers Forget to Buy Milk/Soda/DrinksHe
nce - Available at Register
Prediction
Predicts Some Unknown or Missing Information
based on Available Data
Example
Forecast Sale Value of Electronic Products for
Next Quarter via Available Data from Past Three
Quarters

91
Association Rules

Motivated by Market Analysis
Rules of the Form
Item1Item2 Itemk?Itemk1 Itemn
Example
Beer Soft Drink ? Pop Corn
Problem Discovering All Interesting Association
Rules in a Large Database is Difficult!
Issues
Interestingness
Completeness
Efficiency
Basic Measurement for Association Rules
Support of the Rule
Confidence of the Rule

92
Data Mining Methods (2)

Classification
Determine the Class or Category of an Object
based on its Properties
Example
Classify Companies based on the Final Sale
Results in the Past Quarter
Clustering
Organize a Set of Multi-dimensional Data Objects
in Groups to Minimize Inter-group Similarity is
and Maximize Intra-group Similarity
Example
Group Crime Locations to Find Distribution
Patterns

93
Classification

Two Stages
Learning Stage Construction of a Classification
Function or Model
Classification Stage Predication of Classes of
Objects Using the Function or Model
Tools for Classification
Decision Tree
Bayesian Network
Neural Network
Regression
Problem
Given a Set of Objects whose Classes are Known
(Training Set), Derive a Classification Model
which can Correctly Classify Future Objects

94
An Example

Attributes
Class Attribute - Play/Dont Play the Game
Training Set
Values that Set the Condition for the
Classification
What are the Pattern Below?

95
Data Mining Methods (3)

Summarization
Characterization (Summarization) of General
Features of Objects in the Target Class
Example
Characterize Peoples Buying Patterns on the
Weekend
Potential Impact on Sale Items When Sales
Start
Department Stores with Bonus Coupons
Discrimination
Comparison of General Features of Objects Between
a Target Class and a Contrasting Class
Example
Comparing Students in Engineering and in Art
Attempt to Arrive at Commonalities/Differences

96
Summarization Technique

Attribute-Oriented Induction
Generalization using Concert hierarchy (Taxonomy)

barcode category brand content
size
14998 milk diaryland Skim
2L
food
12998 mechanical MotorCraft valve 23a 12in
Milk bread
...
Skim milk 2 milk
White whole bread wheat
Category Content Count
Lucern Dairyland
Wonder Safeway
milk skim 280 milk 2
98 ...
97
Why is Data Mining Popular?

Technology Push
Technology for Collecting Large Quantity of Data
Bar Code, Scanners, Satellites, Cameras
Technology for Storing Large Collection of Data
Databases, Data Warehouses
Variety of Data Repositories, such as Virtual
Worlds, Digital Media, World Wide Web
Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances
Targeted Marketing by Age, Region, Income, etc.
Exploiting User Preferences/Customized Shopping

98
Requirements Challenges in Data Mining

Security and Social
What Information is Available to Mine?
Preferences via Store Cards/Web Purchases
What is Your Comfort Level with Trends?
User Interfaces and Visualization
What Tools Must be Provided for End Users of Data
Mining Systems?
How are Results for Multi-Dimensional Data
Displayed?
Performance Guarantees
Range from Real-Time for Some Queries to
Long-Term for Other Queries
Data Sources of Complex Data Types or
Unstructured Data - Ability to Format, Clean, and
Load Data Sets

99
Data Mining Visualization

Leverage Improving 3D Graphics and Increased PC
Processing Power for Displaying Results
Significant Research in Visualization w.r.t.
Displaying Multi-Dimensional Data

100
Successful Data Mining Applications

Business Data Analysis and Decision Support
Marketing, Customer Profiling, Market Analysis
and Management, Risk Analysis and Management
Fraud Detection
Detecting Telephone Fraud, Automotive and Health
Insurance Fraud, Credit-card Fraud, Suspicious
Money Transactions (Money Laundering)
Text Mining
Message Filtering (Email, Newsgroups, Etc.)
Newspaper Articles Analysis
Sports
IBM Advanced Scout Analyzed NBA Game Statistics
(Shots Blocked, Assists and Fouls) to Gain
Competitive Advantage

101
Select Data Mining Products
102
Concluding Remarks

Four-Fold Objective
Distributed Database Processing
Web-Based Architectures for DB Interoperability
Data Warehouses
Data Mining of Vast Information Repositories
All Three are Tightly Related
DDBMS can Improve Performance of Mining
Repositories as Backend Database Processors
Web-Based Architectures Provide Access Means for
DDBMS or Mining
Warehouses are Infrastructure to Facilitate
Mining
Clinical Data Repositories, Geographic
Information Systems, Deductive DBMS, Multi-Media
DBMS, Mobile DBMS, Embedded/Real-Time DBMS, etc.