# Data Mining - PowerPoint PPT Presentation

PPT – Data Mining PowerPoint presentation | free to download - id: 207dcc-ZDc1Z

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Data Mining

Description:

### Errors and Error Rates. Precision and Recall. Similarity. Cross Validation ... Cosine (pi,pj)= [ (pikpjk)/ (pik)2 (pjk)2] Inter-clusters and intra-clusters ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 16
Provided by: csWr
Category:
Tags:
Transcript and Presenter's Notes

Title: Data Mining

1
8. Evaluation Methods
• Errors and Error Rates
• Precision and Recall
• Similarity
• Cross Validation
• Various Presentations of Evaluation Results
• Statistical Tests

2
How to evaluate/estimate error
• Resubstitution
• one data set used for both training and for
testing
• Holdout (training and testing)
• 2/3 for training, 1/3 for testing
• Leave-one-out
• If a data set is small
• Cross validation
• 10-fold, why 10?
• m 10-fold CV

3
Error and Error Rate
• Mean and Median
• mean 1/n?xi
• weighted mean (?wixi)/?wi
• median x(n1)/2 if n is odd, else
(xn/2x(n/2)1)/2
• Error disagreement btwn y and y (predicted)
• 1 if they disagree, 0 otherwise (0-1 loss l01)
• Other definitions depending on the output of a
predictor such as quadratic loss l2, absolute
loss l?

4
• Error estimation
• Error rate e Errors/N, where N is the total
number of instances
• Accuracy A 1 - e

5
Precision and Recall
OPred Pve Nve
Pve TP FN
Nve FP TN
• False negative and false positive
• Types of errors for k classes k2-k
• k 3, 33-3 6, k 2, 22-2 2
• Precision (wrt the retrieved)
• P TP/(TPFP)
• Recall (wrt the total relevant)
• R TP/(TPFN)
• PrecisionRecall (PR) and PR gain
• PR gain (PR PR0)/PR0
• Accuracy
• A (TPTN)/(TPTNFPFN)

6
Similarity or Dissimilarity Measures
• Distance (dissimilarity) measures (Triangle
Inequality)
• Euclidean
• City-block, or Manhattan
• Cosine (pi,pj) ?(pikpjk)/ ?(pik)2?(pjk)2
• Inter-clusters and intra-clusters
• Dmin minpi - pj, two data points
• Dmax maxpi - pj
• Centroid methods
• Davg 1/(ninj)??pi pj
• Dmean mi - mj, two means

7
k-Fold Cross Validation
• Cross validation
• 1 fold for training, the rest for testing
• rotate until every fold is used for training
• calculate average
• m k-fold cross validation
• reshuffle data, repeat XV for m times
• what is a suitable k?
• Model complexity
• use of XV
• tree complexity, training/testing error rates

8
Presentations of Evaluation Results
Results are usually about time, space, trend,
average case
• Box-plot
• Whiskers (min, max)
• Box confidence interval
• Graphical equivalent of t-test
• Learning (happy) curves
• Accuracy increases over X
• Its opposite (or error) decreases over X

9
Statistical Tests
• Null hypothesis and alternative hypothesis
• Type I and Type II errors
• Students t test comparing two means
• Paired t test comparing two means
• Chi-Square test
• Contingency table

10
Null Hypothesis
• Null hypothesis (H0)
• No difference between the test statistic and the
actual value of the population parameter
• E.g., H0 ? ?0
• Alternative hypothesis (H1)
• It specifies the parameter value(s) to be
accepted if the H0 is rejected.
• E.g., H1 ? ! ?0 two-tailed test
• Or H1 ? gt ?0 one-tailed test

11
Type I, II errors
• Type I errors (?)
• Rejecting a null hypothesis when it is true (FN)
• Type II errors (?)
• Accepting a null hypothesis when it is false (FP)
• Power 1 ?
• Costs of different errors
• A life-saving medicine appears to be effective,
which is cheap and has no side effect (H0
non-effective)
• Type I error it is effective, not costly
• Type II error it is non-effective, very costly

12
Test using Students t Distribution
• Use t distribution for testing the difference
between two population means is appropriate if
• The population standard deviations are not known
• The samples are small (n lt 30)
• The populations are assumed to be approx. normal
• The two unknown ?1 ?2
• H0 (?1 - ?2) 0, H1 (?1 - ?2) ! 0
• Check the difference of estimated means
normalized by common population means
• degree of freedom and p level of significance
• df n1 n2 2

13
Paired t test
• With paired observations, use paired t test
• Now H0 ?d 0 and H1 ?d ! 0
• Check the estimated difference mean
• The t in previous and current cases are
calculated differently.
• Both are 2-tailed test, p 1 means .5 on each
side
• Excel can do that for you!

Rejection Region
14
Chi-Square Test (the goodness-of-fit)
• Testing a null hypothesis that the population
distribution for a random variable follows a
specified form.
• The chi-square statistic is calculated
• degree of freedom df k-m-1
• k num of data categories
• m num of parameters estimated
• 0 uniform, 1- Poisson, 2 - normal
• Each cell should be at least 5
• One-tail test

C1 C2 ?
I-1 A11 A12 R1
I-2 A21 A22 R2
? C1 C2 N
2 k ?2 ? ? (Aij Eij)2 / Eij
i1 j1
Rejection Region
15
Bibliography
• W. Klosgen J.M. Zytkow, edited, 2002, Handbook
of Data Mining and Knowledge Discovery. Oxford
University Press.
• L. J. Kazmier N. F. Pohl, 1987. Basic
Statistics for Business and Economics.
• R.E. Walpole R.H. Myers, 1993. Probability and
Statistics for Engineers and Scientists (5th
edition). MACMILLAN Publishing Company.