Data Mining An Overview

David Madigan dmadigan_at_rci.rutgers.edu http//stat

.rutgers.edu/madigan

Overview

- Brief Introduction to Data Mining
- Data Mining Algorithms
- Specific Examples
- Algorithms Disease Clusters
- Algorithms Model-Based Clustering
- Algorithms Frequent Items and Association Rules
- Future Directions, etc.

Of Laws, Monsters, and Giants

- Moores law processing capacity doubles every

18 months CPU, cache, memory - Its more aggressive cousin
- Disk storage capacity doubles every 9 months

What is Data Mining?

- Finding interesting structure in data
- Structure refers to statistical patterns,

predictive models, hidden relationships - Examples of tasks addressed by Data Mining
- Predictive Modeling (classification, regression)
- Segmentation (Data Clustering )
- Summarization
- Visualization

(No Transcript)

(No Transcript)

Ronny Kohavi, ICML 1998

Ronny Kohavi, ICML 1998

Ronny Kohavi, ICML 1998

Stories Online Retailing

Chapter 4 Data Analysis and Uncertainty

- Elementary statistical concepts random

variables, distributions, densities,

independence, point and interval estimation, bias

variance, MLE - Model (global, represent prominent structures)

vs. Pattern (local, idiosyncratic deviations) - Frequentist vs. Bayesian
- Sampling methods

Bayesian Estimation

e.g. beta-binomial model

Predictive distribution

Issues to do with p-values

- Using thresholds of 0.05 or 0.01 regardless of

sample size - Multiple testing (e.g. Friedman (1983) selecting

highly significant regressors from noise) - Subtle interpretation Jeffreys (1980) I have

always considered the arguments for the use of P

absurd. They amount to saying that a hypothesis

that may or may not be true is rejected because a

greater departure from the trial value was

improbable that is, that is has not predicted

something that has not happened.

p-value as measure of evidence

Schervish (1996) if hypothesis H implies

hypothesis H', then there should be at least as

much support for H' as for H. - not satisfied by

p-values

Grimmet and Ridenhour (1996) one might expect

an outlying data point to lend support to the

alternative hypothesis in, for instance, a

one-way analysis of variance. - the value of the

outlying data point that minimizes the

significance level can lie within the range of

the data

Chapter 5 Data Mining Algorithms

A data mining algorithm is a well-defined

procedure that takes data as input and produces

output in the form of models or patterns

well-defined can be encoded in

software algorithm must terminate after some

finite number of steps

Data Mining Algorithms

A data mining algorithm is a well-defined

procedure that takes data as input and produces

output in the form of models or patterns

Hand, Mannila, and Smyth

well-defined can be encoded in

software algorithm must terminate after some

finite number of steps

Algorithm Components

1. The task the algorithm is used to address

(e.g. classification, clustering, etc.) 2. The

structure of the model or pattern we are fitting

to the data (e.g. a linear regression model) 3.

The score function used to judge the quality of

the fitted models or patterns (e.g. accuracy,

BIC, etc.) 4. The search or optimization method

used to search over parameters and/or structures

(e.g. steepest descent, MCMC, etc.) 5. The data

management technique used for storing, indexing,

and retrieving data (critical when data too large

to reside in memory)

(No Transcript)

Backpropagation data mining algorithm

x1

h1

x2

y

x3

h2

x4

4

2

- vector of p input values multiplied by p ? d1

weight matrix - resulting d1 values individually transformed by

non-linear function - resulting d1 values multiplied by d1 ? d2 weight

matrix

1

Backpropagation (cont.)

Parameters

Score

Search steepest descent search for structure?

Models and Patterns

Models

Probability Distributions

Structured Data

Prediction

- Linear regression
- Piecewise linear

Models

Probability Distributions

Structured Data

Prediction

- Linear regression
- Piecewise linear
- Nonparamatric regression

(No Transcript)

Models

Probability Distributions

Structured Data

Prediction

- Linear regression
- Piecewise linear
- Nonparametric regression
- Classification

logistic regression naïve bayes/TAN/bayesian

networks NN support vector machines Trees etc.

Models

Probability Distributions

Structured Data

Prediction

- Linear regression
- Piecewise linear
- Nonparametric regression
- Classification

- Parametric models
- Mixtures of parametric models
- Graphical Markov models (categorical, continuous,

mixed)

Models

Probability Distributions

Structured Data

Prediction

- Time series
- Markov models
- Mixture Transition Distribution models
- Hidden Markov models
- Spatial models

- Linear regression
- Piecewise linear
- Nonparametric regression
- Classification

- Parametric models
- Mixtures of parametric models
- Graphical Markov models (categorical, continuous,

mixed)

Markov Models

First-order

e.g.

g linear ? standard first-order auto-regressive

model

yT

y1

y2

y3

First-Order HMM/Kalman Filter

yT

y1

y2

y3

xT

x1

x2

x3

Note to compute p(y1,,yT) need to sum/integrate

over all possible state sequences...

Bias-Variance Tradeoff

High Bias - Low Variance

Low Bias - High Variance overfitting - modeling

the random component

Score function should embody the compromise

The Curse of Dimensionality

X MVNp (0 , I)

- Gaussian kernel density estimation
- Bandwidth chosen to minimize MSE at the mean
- Suppose want

Dimension data points 1 4

2 19 3 67

6 2,790 10

842,000

Patterns

Local

Global

- Outlier detection
- Changepoint detection

- Bump hunting
- Scan statistics
- Association rules

- Clustering via partitioning
- Hierarchical Clustering
- Mixture Models

Scan Statistics via Permutation Tests

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

The curve represents a road Each x marks an

accident Red x denotes an injury accident Black

x means no injury Is there a stretch of road

where there is an unusually large fraction of

injury accidents?

Scan with Fixed Window

- If we know the length of the stretch of road

that we seek, e.g., we could

slide this window long the road and find the most

unusual window location

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

How Unusual is a Window?

- Let pW and pW denote the true probability of

being red inside and outside the window

respectively. Let (xW ,nW) and (xW ,nW) denote

the corresponding counts - Use the GLRT for comparing H0 pW pW versus

H1 pW ? pW

- lambda measures how unusual a window is

-2 log l here has an asymptotic chi-square

distribution with 1df

Permutation Test

- Since we look at the smallest l over all window

locations, need to find the distribution of

smallest-l under the null hypothesis that there

are no clusters - Look at the distribution of smallest-l over say

999 random relabellings of the colors of the xs

smallest-l

xx x xxx x xx x xx x 0.376 xx x xxx

x xx x xx x 0.233 xx x xxx x xx x

xx x 0.412 xx x xxx x xx x xx x

0.222

- Look at the position of observed smallest-l in

this distribution to get the scan statistic

p-value (e.g., if observed smallest-l is 5th

smallest, p-value is 0.005)

Variable Length Window

- No need to use fixed-length window. Examine all

possible windows up to say half the length of the

entire road

O fatal accident O non-fatal accident

Spatial Scan Statistics

- Spatial scan statistic uses, e.g., circles

instead of line segments

(No Transcript)

Spatial-Temporal Scan Statistics

- Spatial-temporal scan statistic use cylinders

where the height of the cylinder represents a

time window

Other Issues

- Poisson model also common (instead of the

bernoulli model) - Covariate adjustment
- Andrew Moores group at CMU efficient algorithms

for scan statistics

Software SaTScan others

http//www.satscan.org http//www.phrl.org

http//www.terraseer.com

Association Rules Support and Confidence

Customer buys both

- Find all the rules Y ? Z with minimum confidence

and support - support, s, probability that a transaction

contains Y Z - confidence, c, conditional probability that a

transaction having Y Z also contains Z

Customer buys diaper

Customer buys beer

- Let minimum support 50, and minimum confidence

50, we have - A ? C (50, 66.6)
- C ? A (50, 100)

Mining Association RulesAn Example

Min. support 50 Min. confidence 50

- For rule A ? C
- support support(A C) 50
- confidence support(A C)/support(A) 66.6
- The Apriori principle
- Any subset of a frequent itemset must be frequent

Mining Frequent Itemsets the Key Step

- Find the frequent itemsets the sets of items

that have minimum support - A subset of a frequent itemset must also be a

frequent itemset - i.e., if AB is a frequent itemset, both A and

B should be a frequent itemset - Iteratively find frequent itemsets with

cardinality from 1 to k (k-itemset) - Use the frequent itemsets to generate association

rules.

The Apriori Algorithm

- Join Step Ck is generated by joining Lk-1with

itself - Prune Step Any (k-1)-itemset that is not

frequent cannot be a subset of a frequent

k-itemset - Pseudo-code
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in

Ck1 that are

contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk

The Apriori Algorithm Example

Database D

L1

C1

Scan D

C2

C2

L2

Scan D

C3

L3

Scan D

Association Rule Mining A Road Map

- Boolean vs. quantitative associations (Based on

the types of values handled) - buys(x, SQLServer) buys(x, DMBook)

buys(x, DBMiner) 0.2, 60 - age(x, 30..39) income(x, 42..48K)

buys(x, PC) 1, 75 - Single dimension vs. multiple dimensional

associations (see ex. Above) - Single level vs. multiple-level analysis
- What brands of beers are associated with what

brands of diapers? - Various extensions (thousands!)

(No Transcript)

Model-based Clustering

Padhraic Smyth, UCI

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

Mixtures of Sequences, Curves,

Generative Model - select a component ck for

individual i - generate data according to p(Di

ck) - p(Di ck) can be very general - e.g.,

sets of sequences, spatial patterns, etc Note

given p(Di ck), we can define an EM algorithm

Example Mixtures of SFSMs

- Simple model for traversal on a Web site
- (equivalent to first-order Markov with end-state)
- Generative model for large sets of Web users
- - different behaviors ltgt mixture of SFSMs
- EM algorithm is quite simple weighted counts

WebCanvas Cadez, Heckerman, et al, KDD 2000

(No Transcript)

(No Transcript)

(No Transcript)

Discussion

- What is data mining? Hard to pin down who

cares? - Textbook statistical ideas with a new focus on

algorithms - Lots of new ideas too

Privacy and Data Mining

Ronny Kohavi, ICML 1998

Analyzing Hospital Discharge Data

David Madigan Rutgers University

Comparing Outcomes Across Providers

- Florence Nightingale wrote in 1863

In attempting to arrive at the truth, I have

applied everywhere for information, but in

scarcely an instance have I been able to obtain

hospital records fit for any purposes of

comparisonI am fain to sum up with an urgent

appeal for adopting some uniform system of

publishing the statistical records of hospitals.

Data

- Data of various kinds are now available e.g.

data concerning all medicare/medicaid hospital

admissions in standard format UB-92 covers gt95

of all admissions nationally - Considerable interest in using these data to

compare providers (hospitals, physician groups,

physicians, etc.) - In Pennsylvannia, large corporations such as

Westinghouse and Hershey Foods are a motivating

force and use the data to select providers.

SYSID DCSTATUS PPXDOW CANCER1

YEAR LOS SPX1DOW CANCER2

QUARTER DCHOUR SPX2DOW MDCHC4

PAF DCDOW SPX3DOW MQSEV

HREGION ECODE SPX4DOW MQNRSP

MAID PDX SPX5DOW PROFCHG

PTSEX SDX1 REFID TOTALCHG

ETHNIC SDX2 ATTID NONCVCHG

RACE SDX3 OPERID ROOMCHG

PSEUDOID SDX4 PAYTYPE1 ANCLRCHG

AGE SDX5 PAYTYPE2 DRUGCHG

AGECAT SDX6 PAYTYPE3 EQUIPCHG

PRIVZIP SDX7 ESTPAYER SPECLCHG

MKTSHARE SDX8 NAIC MISCCHG

COUNTY PPX OCCUR1 APRMDC

STATE SPX1 OCCUR2 APRDRG

ADTYPE SPX2 BILLTYPE APRSOI

ADSOURCE SPX3 DRGHOSP APRROM

ADHOUR SPX4 PCMU MQGCLUST

ADMDX SPX5 DRGHC4 MQGCELL

ADDOW

Pennsylvannia Healthcare Cost Containment

Council. 2000-1, n800,000

Risk Adjustment

- Discharge data like these allow for comparisons

of, e.g., mortality rates for CABG procedure

across hospitals. - Some hospitals accept riskier patients than

others a fair comparison must account for such

differences. - PHC4 (and many other organizations) use indirect

standardization - http//www.phc4.org

(No Transcript)

Hospital Responses

(No Transcript)

p-value computation

- n463 suppose actual number of deaths40
- e29.56
- p-value

p-value lt 0.05

Concerns

- Ad-hoc groupings of strata
- Adequate risk adjustment for outcomes other than

mortality? Sensitivity analysis? Hopeless? - Statistical testing versus estimation
- Simpsons paradox

Risk Cat. N Rate Actual Number Expected Number

Low 800 1 8 8 (1)

High 200 8 16 10 (5)

A

SMR 24/18 1.33 p-value 0.07

Low 200 1 2 2 (1)

High 800 8 64 40 (5)

B

SMR 66/42 1.57 p-value 0.0002

Hierarchical Model

- Patients -gt physicians -gt hospitals
- Build a model using data at each level and

estimate quantities of interest

Bayesian Hierarchical Model

MCMC via WinBUGS

Goldstein and Spiegelhalter, 1996

Discussion

- Markov chain Monte Carlo compute power enable

hierarchical modeling - Software is a significant barrier to the

widespread application of better methodology - Are these data useful for the study of disease?