The Software Infrastructure for Electronic Commerce - PowerPoint PPT Presentation

About This Presentation

Title:

The Software Infrastructure for Electronic Commerce

Description:

Quarter. Month. Week. Day. Data Reduction: Aggregation ... Time(tkey, day, month, quarter, year) Products(pkey, pname, price, pid, category, industry) ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 86

Provided by: johanne46

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Software Infrastructure for Electronic Commerce

1
The Software Infrastructurefor Electronic
Commerce

Databases and Data Mining
Lecture 3 An Introduction To Data Mining (I)
Johannes Gehrke
johannes_at_cs.cornell.edu
http//www.cs.cornell.edu/johannes

2
Lectures Three and Four

Data preprocessing
Multidimensional data analysis
Data mining
Association rules
Classification trees
Clustering

3
Why Data Preprocessing?

Quality decisions come from quality data.
Problems with real life data
Data needs to be integrated from different
sources
Missing values
Noisy and inconsistent values
Data is not at the right level of aggregation

4
Recall The Relational Data Model

A relational database is a set of relations
A relation has two components
The relation instance. Basically a table, with
rows (also records, tuples) and columns (also
fields, attributes).Number of records in the
relation instance Cardinality.
The relation schema. Specifies name of the
relation, plus name and type of each column.
Turing award (Nobel price in CS) for Codd in 1981
for his work on the relational model

5
Example Customer Relation

Relation schemaCustomers (cid integer, name
string, byear integer, state string)
Relation instance

6
Data Integration

Integrate data from multiple sources into a
common format for data mining.
Note A good data warehouse has already taken
care of this step.

Data Warehouse Server
OLTPDBMSs
Extract, clean,transform, aggregate,load, update
Other Data Sources
Data Marts
7
Data Integration (Contd.)

Problem Heterogeneous schema integration
Different attribute names
Different units Sales in , sales in Yen, sales
in DM

8
Data Integration (Contd.)

Problem Heterogeneous schema integration
Different scales Sales in dollars versus sales
in pennies
Derived attributes Annual salary versus monthly
salary

9
Data Integration (Contd.)

Problem Inconsistency due to redundancy
Customer with customer-id 150 has three children
in relation1 and four children in relation2
Computation of annual salary from monthly salary
in relation1 does not match annual-salary
attribute in relation2

10
Missing Values

Often the values of some attributes in a record
are not known.
Reasons for missing values
Attribute does not apply (e.g., maiden name)
Inconsistency with other recorded data
Equipment malfunction
Human errors
Attribute introduced recently (e.g., email
address)

11
Missing Values Approaches

Ignore the record
Complete the missing value
Manual completion Tedious and likely to be
infeasible
Fill in a global constant, e.g., NULL,
unknown
Use the attribute mean or mode
Construct a data mining model that predicts the
missing value

12
Noisy Data

Examples
Faulty data collection instruments
Data entry problems, misspellings
Data transmission problems
Technology limitation
Inconsistency in naming conventions
Duplicate records with different values for a
common field

13
Noisy Data Remove Outliers
14
Noisy Data Smoothing
y
Y1
Y1
x
X1
15
Noisy Data Normalization

Scale data to fall within a small, specified
range
Leave out extreme order statistics
Min-max normalization
Z-score normalization
Normalization by decimal scaling

16
Data Reduction

Problem
Data might not be at the right scale for
analysis.Example Individual phone calls versus
monthly phone call usage
Complex data mining tasks might run a very long
time.Example Multi-terabyte data warehousesOne
disk drive About 20MB/s

17
Data Reduction Attribute Selection

Select the relevant attributes for the data
mining task
If there are k attributes, there are 2k-1
different subsets
Example salary,children,byear,
salary,children, salary,byear,
children,byear, salary, children, byear
Choice of the right subset depends on
Data mining task
Underlying probability distribution

18
Attribute Selection (Contd.)

How to choose relevant attributes
Forward selection Select greedily one attribute
at a time
Backward elimination Start with all attributes,
eliminate irrelevant attributes
Combination of forward selection and backward
elimination

19
Data Reduction Parametric Models

Main idea
Fit a parametric model to the data (e.g.,
multivariate normal distribution)
Store the model parameters, discard the data
(except for outliers)

20
Parametric Models Example

Instead of storing (x,y) pairs, store only the
x-value.Then recompute the y-value usingy ax
b

y
x
21
Data Reduction Sampling

Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew

22
Data Reduction Sampling (Contd.)

Stratified sampling Biased sampling
Example Keep population group ratios
Example Keep minority population group count

23
Data Reduction Histograms

Divide data into buckets and store average (sum)
for each bucket
Can be constructed optimally for one attribute
using dynamic programming
ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
(range, count, sum)(1-2,12,16), (3-6,8,36),
(7-9,6,48), (10-12,6,66)

24
Histograms (Contd.)

Equal-width histogram
Divides the domain of an attribute into k
intervals of equal size
Interval width (Max Min)/k
Computationally easy
Problems with data skew and outliers
ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
(range, count, sum)(1-3,14,22), (4-6,6,30),
(7-9,6,48), (10-12,6,66)

25
Histograms (Contd.)

Equal-depth histogram
Divides the domain of an attribute into k
intervals, each containing the same number of
records
Variable interval width
Computationally easy
ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
(range, count, sum)(1,8,8), (2-4,8,22),
(5-8,8,52), (9-12,8,84)

26
Data Reduction Discretization

Same concept as histograms
Divide domain of a numerical attribute into
intervals.
Replace attribute value with label for interval.
Example
Dataset (age salary)(2530,000),(3080,000),(27
50,000),(6070,000),(5055,000),(2825,000)
Discretized dataset (age, discretizedSalary)(25,
low),(30,high),(27,medium),(60,high),(50,medium),
(28,low)

27
Discretization Natural Hierarchies

Replace low-level concepts with high-level
concepts
Example replacements
Product SKU by category
City by state

Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
28
Data Reduction Aggregation

Natural hierarchies on attributes can be used to
aggregate data along the hierarchy

Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
29
Aggregation Example
30
Data Reduction Other Methods

Principal component analysis
Fourier transformation
Wavelet transformation

31
Data Preprocessing Summary

Problems during data integration
Different attribute names
Different units
Different scales
Derived attributes
Redundant data
Missing values
Imputation
Prediction

Noisy data
Outlier removal
Smoothing
Normalization
Data Reduction
Attribute selection
Fitting parametric models
Sampling
Histograms
Discretization
Aggregation

32
Data Analysis

Data preprocessing
Multidimensional data analysis
Data mining
Association rules
Classification trees
Clustering

33
Multidimensional Data Analysis

Recall
Transactions(ckey, timekey, pkey, units, price)
Customers(ckey, cid, name, byear, city, state,
country)
Time(tkey, day, month, quarter, year)
Products(pkey, pname, price, pid, category,
industry)
Hierarchies on dimensions

Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
34
Multidimensional Data Analysis
Year
Industry
CountryUSA
Quarter
Category
State
Month
Week
Product
City
Day
35
Corresponding Query in SQL

SELECT SUM(units)FROM Transactions T, Products
P, Customers CWHERE T.pkey P.pkey AND T.ckey
C.ckey AND C.country USAGROUP BY
P.industry, C.state
We think that Industry3 in CA is interesting.

Year
Industry
CountryUSA
Quarter
Category
State
Month
Week
Product
City
Day
36
Slice and Drill-Down
Year
IndustryIndustry3
Country
Quarter
Category
StateCA
Month
Week
Product
City
Day
37
Corresponding Query in SQL

SELECT SUM(units)FROM Transactions T, Products
P, Customers CWHERE T.pkey P.pkey AND T.ckey
C.ckey AND P.industry Industry3 AND C.state
CAGROUP BY P.category, C.city
We think that Category3 is interesting.

Year
IndustryIndustry3
Country
Quarter
Category
StateCA
Month
Week
Product
City
Day
38
Slice and Drill-Down
Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
39
Corresponding Query in SQL

SELECT SUM(units)FROM Transactions T, Products
P, Customers CWHERE T.pkey P.pkey AND T.ckey
C.ckey AND C.state CA AND P.category
Category3GROUP BY P.product, C.city
Nothing new in this view of the data.

Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
40
Pivot To (City, Year)
Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
41
Corresponding Query in SQL

SELECT SUM(units)FROM Transactions T, Products
P, Customers CWHERE T.pkey P.pkey AND T.ckey
C.ckey AND C.state CA AND P.category
Category3GROUP BY C.city, T.year

Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
42
Multidimensional Data Analysis

Set of data manipulation operators
Roll-up Go up one step in a dimension hierarchy.
Example month -gt quarter
Drill-down Go down one step in a dimension
hierarchy. Example quarter -gt month
Slice Select a subset of the values of a
dimension. Example All categories -gt only
Category3
Dice Select all values of a dimension. Example
Only Category3 -gt all categories
Pivot Select new dimensions to visualize the
data. Example Pivot to Time(quarter) and
Customer(state)

43
Visual Intuition Cube
roll-up to category
Customer Data Mart
roll-up to state
SH
SF
LA
Product1 Product2Product3 Product4Product5Produ
ct6
20 30 20 15 10 50
Product
roll-up to week
M T W Th F S S
Time
50 Units of Product6 sold on Monday in LA
44
OLAP Server Architectures

Relational OLAP (ROLAP)
Relational DBMS stores data mart
OLAP middleware
Aggregation and navigation logic
Optimized for DBMS in the background, but slow
and complex
Basically only one vendor Microstrategy
Multidimensional OLAP (MOLAP)
Specialized array-based storage structure
Vendors Hyperion (Essbase), Appix (iTM1),
Oracle, Microsoft

45
OLAP Server Architectures

Desktop OLAP (DOLAP)
Performs OLAP directly at your PC
Vendors Cognos (Powerplay), Business Objects,
Brio Technology, Hummingbird
Hybrids and Application OLAP
More www.olapreport.com

46
The OLAP Market
47
Enterprise Reporting Tools
48
Summary Multidimensional Analysis

Spreadsheet style data analysis
Roll-up, drill-down, slice, dice, and pivot your
way to interesting cells in the CUBE
Mainstream technology
Established enterprises already have OLAP
installations
When establishing your e-business, OLAP will be
your first step in data analysis

49
Data Mining
50
Definition

Data mining is the exploration and analysis of
large quantities of data in order to discover
valid, novel, potentially useful, and ultimately
understandable patterns in data.
Example pattern (Census Bureau Data)If
(relationship husband), then (gender male).
99.6

51
Definition (Cont.)

Data mining is the exploration and analysis of
large quantities of data in order to discover
valid, novel, potentially useful, and ultimately
understandable patterns in data.
Valid The patterns hold in general.
Novel We did not know the pattern beforehand.
Useful We can devise actions from the patterns.
Understandable We can interpret and comprehend
the patterns.

52
Why Use Data Mining Today?

Human analysis skills are inadequate
Volume and dimensionality of the data
High data growth rate
Availability of
Data
Storage
Computational power
Off-the-shelf software

53
An Abundance of Data

Supermarket scanners, POS data
Preferred customer cards
Credit card transactions
Direct mail response
Call center records
Web server logs
Customer web site trails
ATM machines
Demographic data

54
Evolution of Database Technology

1960s IMS and network DBMS
1970s The relational data model, small
relational DBMS implementations
1980s Maturing RDBMS, application-specific DBMS
(spatial data, scientific data, images, etc.)
1990s Mature, high-performance RDBMS technology,
parallel DBMS, terabyte data warehouses,
object-relational DBMS, middleware and web
technology
2000s High availability, maintainability,
seamless integration into business processes

55
Computational Power

Moores Law In 1965, Intel Corp. cofounder
Gordon Moore predicted that the density of
transistors in an integrated circuit would double
every year.
Later changed to reflect 18 months progress.
Experts on ants estimate that there are 1016 to
1017 ants on earth. In the year 1997, we produced
one transistor per ant.

56
OTS-Software for Data Mining

ANGOSS KnowledgeStudio
IBM Intelligent Miner
Metaputer PolyAnalyst
SAS Enterprise Miner
SGI Mineset
SPSS Clementine
Many others
More at http//www.kdnuggets.com/software

57
Why Use Data Mining Today?

And competitive pressure!
The secret of success is to know something that
nobody else knows.
Aristotle Onassis
Competition on service, not only on price (Banks,
phone companies, hotel chains, rental car
companies)
Personalization
Your competitors already apply data mining

58
The Knowledge Discovery Process

Steps
Identify business problem
Data preprocessing
Data selection Identify target datasets and
relevant fields
Data cleaning Remove noise and outliers
Data transformation
Create common units
Generate new fields
Data mining and model evaluation
Deployment and measurement of the results

59
Preprocessing and Mining
Knowledge
Patterns
PreprocessedData
TargetData
Interpretation
ModelConstruction
Original Data
Preprocessing
DataIntegrationand Selection
60
What is a Data Mining Model?

A data mining model is a description of a
specific aspect of a dataset. It produces output
values for an assigned set of input values.
Examples
Linear regression model
Classification model
Clustering

61
Data Mining Models (Contd.)

A data mining model can be described at two
levels
Functional level
Describes model in terms of its intended
usage.Examples Classification, clustering
Representational level
Specific representation of a model.Example
Log-linear model, classification tree, nearest
neighbor method.
Black-box models versus transparent models

62
Data Mining Techniques

Supervised learning
Classification and regression
Unsupervised learning
Clustering
Dependency modeling
Associations, summarization, causality
Outlier and deviation detection
Trend analysis and change detection

63
Fields Related to Data Mining

Database systems
Machine learning
Statistics
Visualization
Parallel computing

64
Traditional Analysis Tools

Predefined set of reports
Capture most important known business questions
Generated at fixed points in time
No support for impromptu queries
SQL Write your own SQL query.
Statistics department within a company
Slow reaction to users needs
Limited understanding of the business questions
Problems communicating the findings

65
Example Application Sports

IBM Advanced Scout analyzesNBA game statistics
Shots blocked
Assists
Fouls
http//www.research.ibm.com/scout/home.html

66
Advanced Scout

Example pattern An analysis of thedata from a
game played betweenthe New York Knicks and the
CharlotteHornets revealed that When Glenn Rice
played the shooting guard position, he shot 5/6
(83) on jump shots."
Pattern is interestingThe average shooting
percentage for the Charlotte Hornets during that
game was 54.

67
Example Application Sky Survey

Input data 3 TB of image data with 2 billion sky
objects, took more than six years to complete
Goal Generate a catalog with all objects and
their type
Method Use decision trees as data mining model
Results
94 accuracy in predicting sky object classes
Increased number of faint objects classified by
300
Helped team of astronomers to discover 16 new
high red-shift quasars in one order of magnitude
less observation time

68
Gold Nuggets?

Investment firm mailing list Discovered that old
people do not respond to IRA mailings
Bank clustered their customers. One cluster
Older customers, no mortgage, less likely to have
a credit card
Bank of 1911
Customer churn example

69
Market Basket Analysis

Consider shopping cart filled with several items
Market basket analysis tries to answer the
following questions
Who makes purchases?
What do customers buy together?
In what order do customers purchase items?

70
Market Basket Analysis

Given
A database of customer transactions
Each transaction is a set of items
ExampleTransaction with TID 111 contains items
Pen, Ink, Milk, Juice

71
Market Basket Analysis (Contd.)

Coocurrences
80 of all customers purchase items X, Y and Z
together.
Association rules
60 of all customers who purchase X and Y also
buy Z.
Sequential patterns
60 of customers who first buy X also purchase Y
within three weeks.

72
Confidence and Support

We prune the set of all possible association
rules using two interestingness measures
Confidence of a rule
X gt Y has confidence c if P(YX) c
Support of a rule
X gt Y has support s if P(XY) s
We can also define
Support of an itemset (a coocurrence) XY
XY has support s if P(XY) s

73
Example

Examples
Pen gt MilkSupport 75Confidence 75
Ink gt PenSupport 100Confidence 100

74
Exercise

Find all itemsets withsupport gt 75?

75
Exercise

Can you find all association rules with support
gt 50?

76
Extensions

Imposing constraints
Only find rules involving the dairy department
Only find rules involving expensive products
Only find expensive rules
Only find rules with whiskey on the right hand
side
Only find rules with milk on the left hand side
Hierarchies on the items
Calendars (every Sunday, every 1st of the month)

77
Market Basket Analysis Applications

Sample Applications
Direct marketing
Fraud detection for medical insurance
Floor/shelf planning
Web site layout
Cross-selling

78
Beyond Support and Confidence

Example 5000 students
3000 students play basketball
3750 students eat cereal
2000 students both play basketball and eat cereal

79
Misleading Association Rules

Basketball gt Cereal (support 40, confidence
66.7) is misleading because 75 of students eat
cereal
Basketball gt No cereal (support 20,
confidence 33.3) is more interesting, although
with lower support and confidence

80
Interest