Data Mining - PowerPoint PPT Presentation

Loading...

PPT – Data Mining PowerPoint presentation | free to download - id: 207dcc-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Data Mining

Description:

Errors and Error Rates. Precision and Recall. Similarity. Cross Validation ... Cosine (pi,pj)= [ (pikpjk)/ (pik)2 (pjk)2] Inter-clusters and intra-clusters ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 16
Provided by: csWr
Learn more at: http://www.cs.wright.edu
Category:
Tags: data | hypothesis | mining | null | pjk

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Mining


1
8. Evaluation Methods
  • Errors and Error Rates
  • Precision and Recall
  • Similarity
  • Cross Validation
  • Various Presentations of Evaluation Results
  • Statistical Tests

2
How to evaluate/estimate error
  • Resubstitution
  • one data set used for both training and for
    testing
  • Holdout (training and testing)
  • 2/3 for training, 1/3 for testing
  • Leave-one-out
  • If a data set is small
  • Cross validation
  • 10-fold, why 10?
  • m 10-fold CV

3
Error and Error Rate
  • Mean and Median
  • mean 1/n?xi
  • weighted mean (?wixi)/?wi
  • median x(n1)/2 if n is odd, else
    (xn/2x(n/2)1)/2
  • Error disagreement btwn y and y (predicted)
  • 1 if they disagree, 0 otherwise (0-1 loss l01)
  • Other definitions depending on the output of a
    predictor such as quadratic loss l2, absolute
    loss l?

4
  • Error estimation
  • Error rate e Errors/N, where N is the total
    number of instances
  • Accuracy A 1 - e

5
Precision and Recall
OPred Pve Nve
Pve TP FN
Nve FP TN
  • False negative and false positive
  • Types of errors for k classes k2-k
  • k 3, 33-3 6, k 2, 22-2 2
  • Precision (wrt the retrieved)
  • P TP/(TPFP)
  • Recall (wrt the total relevant)
  • R TP/(TPFN)
  • PrecisionRecall (PR) and PR gain
  • PR gain (PR PR0)/PR0
  • Accuracy
  • A (TPTN)/(TPTNFPFN)

6
Similarity or Dissimilarity Measures
  • Distance (dissimilarity) measures (Triangle
    Inequality)
  • Euclidean
  • City-block, or Manhattan
  • Cosine (pi,pj) ?(pikpjk)/ ?(pik)2?(pjk)2
  • Inter-clusters and intra-clusters
  • Single linkage vs. complete linkage
  • Dmin minpi - pj, two data points
  • Dmax maxpi - pj
  • Centroid methods
  • Davg 1/(ninj)??pi pj
  • Dmean mi - mj, two means

7
k-Fold Cross Validation
  • Cross validation
  • 1 fold for training, the rest for testing
  • rotate until every fold is used for training
  • calculate average
  • m k-fold cross validation
  • reshuffle data, repeat XV for m times
  • what is a suitable k?
  • Model complexity
  • use of XV
  • tree complexity, training/testing error rates

8
Presentations of Evaluation Results
Results are usually about time, space, trend,
average case
  • Box-plot
  • Whiskers (min, max)
  • Box confidence interval
  • Graphical equivalent of t-test
  • Learning (happy) curves
  • Accuracy increases over X
  • Its opposite (or error) decreases over X

9
Statistical Tests
  • Null hypothesis and alternative hypothesis
  • Type I and Type II errors
  • Students t test comparing two means
  • Paired t test comparing two means
  • Chi-Square test
  • Contingency table

10
Null Hypothesis
  • Null hypothesis (H0)
  • No difference between the test statistic and the
    actual value of the population parameter
  • E.g., H0 ? ?0
  • Alternative hypothesis (H1)
  • It specifies the parameter value(s) to be
    accepted if the H0 is rejected.
  • E.g., H1 ? ! ?0 two-tailed test
  • Or H1 ? gt ?0 one-tailed test

11
Type I, II errors
  • Type I errors (?)
  • Rejecting a null hypothesis when it is true (FN)
  • Type II errors (?)
  • Accepting a null hypothesis when it is false (FP)
  • Power 1 ?
  • Costs of different errors
  • A life-saving medicine appears to be effective,
    which is cheap and has no side effect (H0
    non-effective)
  • Type I error it is effective, not costly
  • Type II error it is non-effective, very costly

12
Test using Students t Distribution
  • Use t distribution for testing the difference
    between two population means is appropriate if
  • The population standard deviations are not known
  • The samples are small (n lt 30)
  • The populations are assumed to be approx. normal
  • The two unknown ?1 ?2
  • H0 (?1 - ?2) 0, H1 (?1 - ?2) ! 0
  • Check the difference of estimated means
    normalized by common population means
  • degree of freedom and p level of significance
  • df n1 n2 2

13
Paired t test
  • With paired observations, use paired t test
  • Now H0 ?d 0 and H1 ?d ! 0
  • Check the estimated difference mean
  • The t in previous and current cases are
    calculated differently.
  • Both are 2-tailed test, p 1 means .5 on each
    side
  • Excel can do that for you!

Rejection Region
14
Chi-Square Test (the goodness-of-fit)
  • Testing a null hypothesis that the population
    distribution for a random variable follows a
    specified form.
  • The chi-square statistic is calculated
  • degree of freedom df k-m-1
  • k num of data categories
  • m num of parameters estimated
  • 0 uniform, 1- Poisson, 2 - normal
  • Each cell should be at least 5
  • One-tail test

C1 C2 ?
I-1 A11 A12 R1
I-2 A21 A22 R2
? C1 C2 N
2 k ?2 ? ? (Aij Eij)2 / Eij
i1 j1
Rejection Region
15
Bibliography
  • W. Klosgen J.M. Zytkow, edited, 2002, Handbook
    of Data Mining and Knowledge Discovery. Oxford
    University Press.
  • L. J. Kazmier N. F. Pohl, 1987. Basic
    Statistics for Business and Economics.
  • R.E. Walpole R.H. Myers, 1993. Probability and
    Statistics for Engineers and Scientists (5th
    edition). MACMILLAN Publishing Company.
About PowerShow.com