The Software Infrastructure for Electronic Commerce - PowerPoint PPT Presentation

About This Presentation
Title:

The Software Infrastructure for Electronic Commerce

Description:

Quarter. Month. Week. Day. Data Reduction: Aggregation ... Time(tkey, day, month, quarter, year) Products(pkey, pname, price, pid, category, industry) ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 86
Provided by: johanne46
Category:

less

Transcript and Presenter's Notes

Title: The Software Infrastructure for Electronic Commerce


1
The Software Infrastructurefor Electronic
Commerce
  • Databases and Data Mining
  • Lecture 3 An Introduction To Data Mining (I)
  • Johannes Gehrke
  • johannes_at_cs.cornell.edu
  • http//www.cs.cornell.edu/johannes

2
Lectures Three and Four
  • Data preprocessing
  • Multidimensional data analysis
  • Data mining
  • Association rules
  • Classification trees
  • Clustering

3
Why Data Preprocessing?
  • Quality decisions come from quality data.
  • Problems with real life data
  • Data needs to be integrated from different
    sources
  • Missing values
  • Noisy and inconsistent values
  • Data is not at the right level of aggregation

4
Recall The Relational Data Model
  • A relational database is a set of relations
  • A relation has two components
  • The relation instance. Basically a table, with
    rows (also records, tuples) and columns (also
    fields, attributes).Number of records in the
    relation instance Cardinality.
  • The relation schema. Specifies name of the
    relation, plus name and type of each column.
  • Turing award (Nobel price in CS) for Codd in 1981
    for his work on the relational model

5
Example Customer Relation
  • Relation schemaCustomers (cid integer, name
    string, byear integer, state string)
  • Relation instance

6
Data Integration
  • Integrate data from multiple sources into a
    common format for data mining.
  • Note A good data warehouse has already taken
    care of this step.

Data Warehouse Server
OLTPDBMSs
Extract, clean,transform, aggregate,load, update
Other Data Sources
Data Marts
7
Data Integration (Contd.)
  • Problem Heterogeneous schema integration
  • Different attribute names
  • Different units Sales in , sales in Yen, sales
    in DM

8
Data Integration (Contd.)
  • Problem Heterogeneous schema integration
  • Different scales Sales in dollars versus sales
    in pennies
  • Derived attributes Annual salary versus monthly
    salary

9
Data Integration (Contd.)
  • Problem Inconsistency due to redundancy
  • Customer with customer-id 150 has three children
    in relation1 and four children in relation2
  • Computation of annual salary from monthly salary
    in relation1 does not match annual-salary
    attribute in relation2

10
Missing Values
  • Often the values of some attributes in a record
    are not known.
  • Reasons for missing values
  • Attribute does not apply (e.g., maiden name)
  • Inconsistency with other recorded data
  • Equipment malfunction
  • Human errors
  • Attribute introduced recently (e.g., email
    address)

11
Missing Values Approaches
  • Ignore the record
  • Complete the missing value
  • Manual completion Tedious and likely to be
    infeasible
  • Fill in a global constant, e.g., NULL,
    unknown
  • Use the attribute mean or mode
  • Construct a data mining model that predicts the
    missing value

12
Noisy Data
  • Examples
  • Faulty data collection instruments
  • Data entry problems, misspellings
  • Data transmission problems
  • Technology limitation
  • Inconsistency in naming conventions
  • Duplicate records with different values for a
    common field

13
Noisy Data Remove Outliers
14
Noisy Data Smoothing
y
Y1
Y1
x
X1
15
Noisy Data Normalization
  • Scale data to fall within a small, specified
    range
  • Leave out extreme order statistics
  • Min-max normalization
  • Z-score normalization
  • Normalization by decimal scaling

16
Data Reduction
  • Problem
  • Data might not be at the right scale for
    analysis.Example Individual phone calls versus
    monthly phone call usage
  • Complex data mining tasks might run a very long
    time.Example Multi-terabyte data warehousesOne
    disk drive About 20MB/s

17
Data Reduction Attribute Selection
  • Select the relevant attributes for the data
    mining task
  • If there are k attributes, there are 2k-1
    different subsets
  • Example salary,children,byear,
    salary,children, salary,byear,
    children,byear, salary, children, byear
  • Choice of the right subset depends on
  • Data mining task
  • Underlying probability distribution

18
Attribute Selection (Contd.)
  • How to choose relevant attributes
  • Forward selection Select greedily one attribute
    at a time
  • Backward elimination Start with all attributes,
    eliminate irrelevant attributes
  • Combination of forward selection and backward
    elimination

19
Data Reduction Parametric Models
  • Main idea
  • Fit a parametric model to the data (e.g.,
    multivariate normal distribution)
  • Store the model parameters, discard the data
    (except for outliers)

20
Parametric Models Example
  • Instead of storing (x,y) pairs, store only the
    x-value.Then recompute the y-value usingy ax
    b

y
x
21
Data Reduction Sampling
  • Choose a representative subset of the data
  • Simple random sampling may have very poor
    performance in the presence of skew

22
Data Reduction Sampling (Contd.)
  • Stratified sampling Biased sampling
  • Example Keep population group ratios
  • Example Keep minority population group count

23
Data Reduction Histograms
  • Divide data into buckets and store average (sum)
    for each bucket
  • Can be constructed optimally for one attribute
    using dynamic programming
  • ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
    5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
    (range, count, sum)(1-2,12,16), (3-6,8,36),
    (7-9,6,48), (10-12,6,66)

24
Histograms (Contd.)
  • Equal-width histogram
  • Divides the domain of an attribute into k
    intervals of equal size
  • Interval width (Max Min)/k
  • Computationally easy
  • Problems with data skew and outliers
  • ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
    5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
    (range, count, sum)(1-3,14,22), (4-6,6,30),
    (7-9,6,48), (10-12,6,66)

25
Histograms (Contd.)
  • Equal-depth histogram
  • Divides the domain of an attribute into k
    intervals, each containing the same number of
    records
  • Variable interval width
  • Computationally easy
  • ExampleDataset 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,
    5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram
    (range, count, sum)(1,8,8), (2-4,8,22),
    (5-8,8,52), (9-12,8,84)

26
Data Reduction Discretization
  • Same concept as histograms
  • Divide domain of a numerical attribute into
    intervals.
  • Replace attribute value with label for interval.
  • Example
  • Dataset (age salary)(2530,000),(3080,000),(27
    50,000),(6070,000),(5055,000),(2825,000)
  • Discretized dataset (age, discretizedSalary)(25,
    low),(30,high),(27,medium),(60,high),(50,medium),
    (28,low)

27
Discretization Natural Hierarchies
  • Replace low-level concepts with high-level
    concepts
  • Example replacements
  • Product SKU by category
  • City by state

Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
28
Data Reduction Aggregation
  • Natural hierarchies on attributes can be used to
    aggregate data along the hierarchy

Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
29
Aggregation Example
30
Data Reduction Other Methods
  • Principal component analysis
  • Fourier transformation
  • Wavelet transformation

31
Data Preprocessing Summary
  • Problems during data integration
  • Different attribute names
  • Different units
  • Different scales
  • Derived attributes
  • Redundant data
  • Missing values
  • Imputation
  • Prediction
  • Noisy data
  • Outlier removal
  • Smoothing
  • Normalization
  • Data Reduction
  • Attribute selection
  • Fitting parametric models
  • Sampling
  • Histograms
  • Discretization
  • Aggregation

32
Data Analysis
  • Data preprocessing
  • Multidimensional data analysis
  • Data mining
  • Association rules
  • Classification trees
  • Clustering

33
Multidimensional Data Analysis
  • Recall
  • Transactions(ckey, timekey, pkey, units, price)
  • Customers(ckey, cid, name, byear, city, state,
    country)
  • Time(tkey, day, month, quarter, year)
  • Products(pkey, pname, price, pid, category,
    industry)
  • Hierarchies on dimensions

Year
Industry
Country
Quarter
Category
State
Month
Week
Product
City
Day
34
Multidimensional Data Analysis
Year
Industry
CountryUSA
Quarter
Category
State
Month
Week
Product
City
Day
35
Corresponding Query in SQL
  • SELECT SUM(units)FROM Transactions T, Products
    P, Customers CWHERE T.pkey P.pkey AND T.ckey
    C.ckey AND C.country USAGROUP BY
    P.industry, C.state
  • We think that Industry3 in CA is interesting.

Year
Industry
CountryUSA
Quarter
Category
State
Month
Week
Product
City
Day
36
Slice and Drill-Down
Year
IndustryIndustry3
Country
Quarter
Category
StateCA
Month
Week
Product
City
Day
37
Corresponding Query in SQL
  • SELECT SUM(units)FROM Transactions T, Products
    P, Customers CWHERE T.pkey P.pkey AND T.ckey
    C.ckey AND P.industry Industry3 AND C.state
    CAGROUP BY P.category, C.city
  • We think that Category3 is interesting.

Year
IndustryIndustry3
Country
Quarter
Category
StateCA
Month
Week
Product
City
Day
38
Slice and Drill-Down
Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
39
Corresponding Query in SQL
  • SELECT SUM(units)FROM Transactions T, Products
    P, Customers CWHERE T.pkey P.pkey AND T.ckey
    C.ckey AND C.state CA AND P.category
    Category3GROUP BY P.product, C.city
  • Nothing new in this view of the data.

Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
40
Pivot To (City, Year)
Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
41
Corresponding Query in SQL
  • SELECT SUM(units)FROM Transactions T, Products
    P, Customers CWHERE T.pkey P.pkey AND T.ckey
    C.ckey AND C.state CA AND P.category
    Category3GROUP BY C.city, T.year

Year
Country
Industry
Quarter
StateCA
CategoryCategory3
Month
Week
City
Day
Product
42
Multidimensional Data Analysis
  • Set of data manipulation operators
  • Roll-up Go up one step in a dimension hierarchy.
    Example month -gt quarter
  • Drill-down Go down one step in a dimension
    hierarchy. Example quarter -gt month
  • Slice Select a subset of the values of a
    dimension. Example All categories -gt only
    Category3
  • Dice Select all values of a dimension. Example
    Only Category3 -gt all categories
  • Pivot Select new dimensions to visualize the
    data. Example Pivot to Time(quarter) and
    Customer(state)

43
Visual Intuition Cube
roll-up to category
Customer Data Mart
roll-up to state
SH
SF
LA
Product1 Product2Product3 Product4Product5Produ
ct6
20 30 20 15 10 50
Product
roll-up to week
M T W Th F S S
Time
50 Units of Product6 sold on Monday in LA
44
OLAP Server Architectures
  • Relational OLAP (ROLAP)
  • Relational DBMS stores data mart
  • OLAP middleware
  • Aggregation and navigation logic
  • Optimized for DBMS in the background, but slow
    and complex
  • Basically only one vendor Microstrategy
  • Multidimensional OLAP (MOLAP)
  • Specialized array-based storage structure
  • Vendors Hyperion (Essbase), Appix (iTM1),
    Oracle, Microsoft

45
OLAP Server Architectures
  • Desktop OLAP (DOLAP)
  • Performs OLAP directly at your PC
  • Vendors Cognos (Powerplay), Business Objects,
    Brio Technology, Hummingbird
  • Hybrids and Application OLAP
  • More www.olapreport.com

46
The OLAP Market
47
Enterprise Reporting Tools
48
Summary Multidimensional Analysis
  • Spreadsheet style data analysis
  • Roll-up, drill-down, slice, dice, and pivot your
    way to interesting cells in the CUBE
  • Mainstream technology
  • Established enterprises already have OLAP
    installations
  • When establishing your e-business, OLAP will be
    your first step in data analysis

49
Data Mining
50
Definition
  • Data mining is the exploration and analysis of
    large quantities of data in order to discover
    valid, novel, potentially useful, and ultimately
    understandable patterns in data.
  • Example pattern (Census Bureau Data)If
    (relationship husband), then (gender male).
    99.6

51
Definition (Cont.)
  • Data mining is the exploration and analysis of
    large quantities of data in order to discover
    valid, novel, potentially useful, and ultimately
    understandable patterns in data.
  • Valid The patterns hold in general.
  • Novel We did not know the pattern beforehand.
  • Useful We can devise actions from the patterns.
  • Understandable We can interpret and comprehend
    the patterns.

52
Why Use Data Mining Today?
  • Human analysis skills are inadequate
  • Volume and dimensionality of the data
  • High data growth rate
  • Availability of
  • Data
  • Storage
  • Computational power
  • Off-the-shelf software

53
An Abundance of Data
  • Supermarket scanners, POS data
  • Preferred customer cards
  • Credit card transactions
  • Direct mail response
  • Call center records
  • Web server logs
  • Customer web site trails
  • ATM machines
  • Demographic data

54
Evolution of Database Technology
  • 1960s IMS and network DBMS
  • 1970s The relational data model, small
    relational DBMS implementations
  • 1980s Maturing RDBMS, application-specific DBMS
    (spatial data, scientific data, images, etc.)
  • 1990s Mature, high-performance RDBMS technology,
    parallel DBMS, terabyte data warehouses,
    object-relational DBMS, middleware and web
    technology
  • 2000s High availability, maintainability,
    seamless integration into business processes

55
Computational Power
  • Moores Law In 1965, Intel Corp. cofounder
    Gordon Moore predicted that the density of
    transistors in an integrated circuit would double
    every year.
  • Later changed to reflect 18 months progress.
  • Experts on ants estimate that there are 1016 to
    1017 ants on earth. In the year 1997, we produced
    one transistor per ant.

56
OTS-Software for Data Mining
  • ANGOSS KnowledgeStudio
  • IBM Intelligent Miner
  • Metaputer PolyAnalyst
  • SAS Enterprise Miner
  • SGI Mineset
  • SPSS Clementine
  • Many others
  • More at http//www.kdnuggets.com/software

57
Why Use Data Mining Today?
  • And competitive pressure!
  • The secret of success is to know something that
    nobody else knows.
  • Aristotle Onassis
  • Competition on service, not only on price (Banks,
    phone companies, hotel chains, rental car
    companies)
  • Personalization
  • Your competitors already apply data mining

58
The Knowledge Discovery Process
  • Steps
  • Identify business problem
  • Data preprocessing
  • Data selection Identify target datasets and
    relevant fields
  • Data cleaning Remove noise and outliers
  • Data transformation
  • Create common units
  • Generate new fields
  • Data mining and model evaluation
  • Deployment and measurement of the results

59
Preprocessing and Mining
Knowledge
Patterns
PreprocessedData
TargetData
Interpretation
ModelConstruction
Original Data
Preprocessing
DataIntegrationand Selection
60
What is a Data Mining Model?
  • A data mining model is a description of a
    specific aspect of a dataset. It produces output
    values for an assigned set of input values.
  • Examples
  • Linear regression model
  • Classification model
  • Clustering

61
Data Mining Models (Contd.)
  • A data mining model can be described at two
    levels
  • Functional level
  • Describes model in terms of its intended
    usage.Examples Classification, clustering
  • Representational level
  • Specific representation of a model.Example
    Log-linear model, classification tree, nearest
    neighbor method.
  • Black-box models versus transparent models

62
Data Mining Techniques
  • Supervised learning
  • Classification and regression
  • Unsupervised learning
  • Clustering
  • Dependency modeling
  • Associations, summarization, causality
  • Outlier and deviation detection
  • Trend analysis and change detection

63
Fields Related to Data Mining
  • Database systems
  • Machine learning
  • Statistics
  • Visualization
  • Parallel computing

64
Traditional Analysis Tools
  • Predefined set of reports
  • Capture most important known business questions
  • Generated at fixed points in time
  • No support for impromptu queries
  • SQL Write your own SQL query.
  • Statistics department within a company
  • Slow reaction to users needs
  • Limited understanding of the business questions
  • Problems communicating the findings

65
Example Application Sports
  • IBM Advanced Scout analyzesNBA game statistics
  • Shots blocked
  • Assists
  • Fouls
  • http//www.research.ibm.com/scout/home.html

66
Advanced Scout
  • Example pattern An analysis of thedata from a
    game played betweenthe New York Knicks and the
    CharlotteHornets revealed that When Glenn Rice
    played the shooting guard position, he shot 5/6
    (83) on jump shots."
  • Pattern is interestingThe average shooting
    percentage for the Charlotte Hornets during that
    game was 54.

67
Example Application Sky Survey
  • Input data 3 TB of image data with 2 billion sky
    objects, took more than six years to complete
  • Goal Generate a catalog with all objects and
    their type
  • Method Use decision trees as data mining model
  • Results
  • 94 accuracy in predicting sky object classes
  • Increased number of faint objects classified by
    300
  • Helped team of astronomers to discover 16 new
    high red-shift quasars in one order of magnitude
    less observation time

68
Gold Nuggets?
  • Investment firm mailing list Discovered that old
    people do not respond to IRA mailings
  • Bank clustered their customers. One cluster
    Older customers, no mortgage, less likely to have
    a credit card
  • Bank of 1911
  • Customer churn example

69
Market Basket Analysis
  • Consider shopping cart filled with several items
  • Market basket analysis tries to answer the
    following questions
  • Who makes purchases?
  • What do customers buy together?
  • In what order do customers purchase items?

70
Market Basket Analysis
  • Given
  • A database of customer transactions
  • Each transaction is a set of items
  • ExampleTransaction with TID 111 contains items
    Pen, Ink, Milk, Juice

71
Market Basket Analysis (Contd.)
  • Coocurrences
  • 80 of all customers purchase items X, Y and Z
    together.
  • Association rules
  • 60 of all customers who purchase X and Y also
    buy Z.
  • Sequential patterns
  • 60 of customers who first buy X also purchase Y
    within three weeks.

72
Confidence and Support
  • We prune the set of all possible association
    rules using two interestingness measures
  • Confidence of a rule
  • X gt Y has confidence c if P(YX) c
  • Support of a rule
  • X gt Y has support s if P(XY) s
  • We can also define
  • Support of an itemset (a coocurrence) XY
  • XY has support s if P(XY) s

73
Example
  • Examples
  • Pen gt MilkSupport 75Confidence 75
  • Ink gt PenSupport 100Confidence 100

74
Exercise
  • Find all itemsets withsupport gt 75?

75
Exercise
  • Can you find all association rules with support
    gt 50?

76
Extensions
  • Imposing constraints
  • Only find rules involving the dairy department
  • Only find rules involving expensive products
  • Only find expensive rules
  • Only find rules with whiskey on the right hand
    side
  • Only find rules with milk on the left hand side
  • Hierarchies on the items
  • Calendars (every Sunday, every 1st of the month)

77
Market Basket Analysis Applications
  • Sample Applications
  • Direct marketing
  • Fraud detection for medical insurance
  • Floor/shelf planning
  • Web site layout
  • Cross-selling

78
Beyond Support and Confidence
  • Example 5000 students
  • 3000 students play basketball
  • 3750 students eat cereal
  • 2000 students both play basketball and eat cereal

79
Misleading Association Rules
  • Basketball gt Cereal (support 40, confidence
    66.7) is misleading because 75 of students eat
    cereal
  • Basketball gt No cereal (support 20,
    confidence 33.3) is more interesting, although
    with lower support and confidence

80
Interest
  • Interest of rule A gt B P(AB)/(P(A)P(B))
  • Symmetric (uses both P(A) and P(B))
  • Note that confidence is not symmetric(confidence
    of rule A gt B P(AB)/P(A))
  • Interest values
  • Interest 1 A and B are independent
    (P(AB)P(B)P(A))
  • Interest gt 1 A and B are positively correlated
  • Interest lt 1 A and B are negatively correlated

81
Interest Example
82
Extensions
  • Imposing constraints
  • Only find rules involving the dairy department
  • Only find rules involving expensive products
  • Only find expensive rules
  • Only find rules with whiskey on the right hand
    side
  • Only find rules with milk on the left hand side
  • Hierarchies on the items
  • Calendars (every Sunday, every 1st of the month)

83
Applications
  • Sample Applications
  • Direct marketing
  • Fraud detection for medical insurance
  • Floor/shelf planning
  • Web site layout
  • Cross-selling
  • Case study

84
Summary
  • Data preprocessing
  • Quality analysis comes from quality data
  • Multidimensional data analysis
  • Interactive, spreadsheet-style data analysis
  • Data mining
  • The knowledge discovery process
  • Association rules

85
Questions?
  • (In the fourth lecture More data mining.)
Write a Comment
User Comments (0)
About PowerShow.com