David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation
Title:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

Description:

Data Mining (and machine learning) DM Lecture 3: Basic Statistics and Coursework 1 Communities and Crime Here is an interesting dataset -- state: US state (by number ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 44
Provided by: macsHwAc4
Category:

less

Transcript and Presenter's Notes

Title: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com


1
Data Mining(and machine learning)
  • DM Lecture 3 Basic Statistics and Coursework 1

2
Communities and Crime
  • Here is an interesting dataset

3
(No Transcript)
4
  • -- state US state (by number) -
  • -- county numeric code for county
  • -- community numeric code for community -
  • -- communityname community name
  • -- fold fold number for non-random 10 fold cross
    validation,
  • -- population population for community (numeric
    - decimal)
  • -- householdsize mean people per household
    (numeric - decimal)
  • -- racepctblack percentage of population that is
    african american (numeric - decimal)
  • -- racePctWhite percentage of population that is
    caucasian (numeric - decimal)
  • -- racePctAsian percentage of population that is
    of asian heritage (numeric - decimal)
  • -- racePctHisp percentage of population that is
    of hispanic heritage (numeric - decimal)
  • -- agePct12t21 percentage of population that is
    12-21 in age (numeric - decimal)
  • -- agePct12t29 percentage of population that is
    12-29 in age (numeric - decimal)
  • -- agePct16t24 percentage of population that is
    16-24 in age (numeric - decimal)
  • -- agePct65up percentage of population that is
    65 and over in age (numeric - decimal)
  • -- numbUrban number of people living in areas
    classified as urban (numeric - decimal)
  • -- pctUrban percentage of people living in areas
    classified as urban (numeric - decimal)
  • -- medIncome median household income (numeric -
    decimal)
  • -- pctWWage percentage of households with wage
    or salary income in 1989 (numeric - decimal)

5
Mining the CC data
  • Lets do some basic preprocessing and mining of
    these data, to start to grasp whether we can find
    any patterns that will predict certain levels of
    violent crime

6
etc about 2,000 instances
7
First some sensible preprocessing
  • The first 5 fields are (probably) not useful for
    prediction they are more like ID fields for
    the record. So, lets remove them.
  • There are many cases of missing data here too
    lets remove any field which has any missing data
    in it at all. This is OK for the CC data, still
    leaving 100 fields.

8
First some sensible preprocessing
  • I downloaded the data. First I converted it to a
    space-separated form, rather than
    comma-separated, because I prefer it that way. I
    wrote an awk script to do this called cs2ss.awk,
    here
  • http//www.macs.hw.ac.uk/dwcorne/Teaching/DMML/cs
    2ss.awk
  • I did that with this command line on a unix
    machine
  • awk f cs2ss.awk lt communities.data gt commss.txt
  • Placing the new version in commss.txt
  • Then, I wanted to remove the first 5 fields, and
    remove any field in which any record contained
    missing values. I wrote an awk script for that
    too
  • http//www.macs.hw.ac.uk/dwcorne/Teaching/DMML/fi
    xcommdata.awk
  • and did this
  • awk f fixcommdata.awk lt comss.txt gt
    commssfixed.txt

9
Normalisation
  • The fields in these data happen to be already
    min-max normalised into 0,1 I wondered whether
    it would also be good to z-normalise the fields.
    So I wrote an awk script for z-normalisation, and
    produced a version that had that done
  • http//www.macs.hw.ac.uk/dwcorne/Teaching/DMML/zn
    orm.awk
  • awk f znorm.awk lt commsfixed.txt gt
    commssfixedznorm.txt
  • In these data, the class value is numeric,
    between 0 and 1, indicating (already normalized)
    the relative amount of violent crime in the
    community in question. To make it easier to find
    patterns and relationships, I produced new
    versions of each dataset where the class value
    was either 0 (low) or 1 (high) 0 in the cases
    where it had been lt 0.4, and 1 otherwise. I
    used an awk script for that too, and did some
    renaming of files, and ended up with
  • commssfixed.txt two-class
  • commssfixedznorm.txt two-class z-normlalised

10
Now, I wonder how good is 1-NN at predicting the
class for these data?
  • If only using fields 2030 to work out the
    distance values, the answer is
  • Unchanged data (in this case, already min-max
    normalised to 0,1) 81.1
  • Z-normalised 81.5
  • But note that 81 of the data is class 0 so if
    you always guess 0, your accuracy will be 81.0.

11
Now lets look at the data in more detail some
histograms of the fields
  • Here is the distribution of values in field 6 for
    class 0 it is a 5-bin distribution.

12
Lets look at the distributions of field 6 for
class 0 and class 1 together ( of pop that is
Hispanic)

13
Field 7 ( of pop aged 12--21)
14
Field 8 ( of pop aged 1229)
15
Field 9 ( of pop aged 1624)
16
Field 10 ( of pop aged gt 65
17
Which two fields seem most useful for
discriminating between classes 0 and 1?
18
Fields 6 and 7
  • Maybe we will get better 1-NN results using only
    the important fields? 2 is (most often) too small
    a number of fields, but anyway
  • I produced versions of the dataset that had only
    fields 6, 7 and 100 (these two, and the class
    field) I then calculated 1-NN accuracy for
    these. Results
  • Unchanged version fields 6 and 7 70.8
    (was 81.1)
  • Z-normalised fields 6 and 7 70.9 (was
    81.5)
  • Not very successful! But I didnt expect that
    working with several more of the important fields
    would quite possibly give better accuracies, but
    may take too much time to demonstrate, or do in
    your assignment.

19
Coursework 1
  • You will do what we just did, for four datasets
  • For each one
  • Download it (of course), then do some simple
    preparation
  • Produce a version of the data set where each
    non-class field is min-max normalised
  • (for the Communities and Crime
    dataset, do z-normalisation instead)
  • Convert into a two class dataset do this for
    both the original and normalised cases.
  • Calculate the accuracy of 1-nearest-neighbour
    classification for your dataset do this for both
    original and normalised versions
  • Generate 5-bin histograms of the distribution of
    the first five fields, for each of the two
    classes.
  • Write 100200 words describing how the
    distributions differ between the two classes, and
    describing what you think are the two most
    important fields for discriminating between the
    classes.
  • Produce a reduced dataset (two versions original
    and normalised) which contains only three fields
    the two you considered most important, and the
    class field.
  • Repeat step 4, but this time for the reduced
    datasets.

20
What you email me
One brief paragraph that tells me how you did it
/ what tools you used. This will not affect the
marks I would just like to know. For step 4 A
table that tells me the answers for each dataset,
followed by a paragraph that attempts to explain
any differences in performance between the
normalised and original versions, or explains why
performance is similar. For steps 5 and 6 1
page per dataset on each page, the 10
histograms, and the discussion (step 7). For
step 8 sane as step 4. That must all be done
within 6 sides of A4.
21
You should know what Z-normalisation is, so here
is a brief lecture on basic statistics, including
that
22
Fundamental Statistics Definitions
  • A Population is the total collection of all
    items/individuals/events under consideration
  • A Sample is that part of a population which has
    been
  • observed or selected for analysis
  • E.g. all students is a population. Students at
    HWU is a sample this class is a sample, etc
  • A Statistic is a measure which can be computed
    to describe a characteristic of the sample (e.g.
    the sample mean)
  • The reason for doing this is almost always to
    estimate (i.e. make a good guess) things about
    that characteristic in the population

23
E.g.
  • This class is a sample from the population of
    students at HWU
  • (it can also be considered as a sample of other
    populations like what?)
  • One statistic of this sample is your mean
    weight. Suppose that is 65Kg. I.e. this is the
    sample mean.
  • Is 65Kg a good estimate for the mean weight of
    the population?
  • Another statistic suppose 10 of you are
    married. Is this a good estimate for the
    proportion that are married in the population?

24
Some Simple Statistics
  • The Mean (average) is the sum of the values in a
    sample divided by the number of values
  • The Median is the midpoint of the values in a
    sample (50 above 50 below) after they have
    been ordered (e.g. from the smallest to the
    largest)
  • The Mode is the value that appears most
    frequently in a sample
  • The Range is the difference between the smallest
    and largest values in a sample
  • The Variance is a measure of the dispersion of
    the values in a sample how closely the
    observations cluster around the mean of the
    sample
  • The Standard Deviation is the square root of the
    variance of a sample

25
Statistical moments
  • The m-th moment about the mean (µ) of a sample
    is
  • Where n is the number of items in the sample.
  • The first moment (m 1) is 0!
  • The second moment (m 2) is the variance
  • (and square root of the variance is the standard
    deviation)
  • The third moment can be used in tests for
    skewness
  • The fourth moment can be used in tests for
    kurtosis

26
Variation and Standard Deviation
  • The variance of a sample is the 2nd moment
  • Where n is the number of items in the sample.
  • square root of the variance is the standard
    deviation)

27
Distributions / Histograms
A Normal (aka Gaussian) distribution (image from
Mathworld)
28
Normal or Gaussian distributions
  • tend to be everywhere
  • Given a typical numeric field in a typical
    dataset, it is common that most values are
    centred around a particular value (the mean), and
    the proportion with larger or smaller values
    tends to tail off.

29
Normal or Gaussian distributions
  • We just saw this fields 710 were Normal-ish
  • Heights, weights, times (e.g. for 100m sprint,
    for lengths of software projects), measurements
    (e.g. length of a leaf, waist measurement,
    coursework marks, level of protein A in a blood
    sample, ) all tend to be Normally distributed.
    Why??

30
Sometimes distributions are uniform
Uniform distributions. Every possible value
tends to be equally likely
31
This figure is from http//mathworld.wolfram.com
/Dice.html
One die uniform distribution of possible
totals But look what happens as soon as the
value is a sum of things The more things, the
more Gaussian the distribution. Are measurements
(etc.) usually the sum of many factors?
32
Probability Distributions
  • If a population (e.g. field of a dataset) is
    expected to match a standard probability
    distribution then a wealth of statistical
    knowledge and results can be brought to bear on
    its analysis
  • Many standard statistical techniques are based on
    the assumption that the underlying distribution
    of a population is Normal (Gaussian)
  • Usually this assumption works fine

33
A closer look at the normal distribution
This is the ND with mean mu and std sigma
34
More than just a pretty bell shape
  • Suppose mean of your sample is 1.8 and suppose
    std of your sample is 0.12
  • Theory tells us that if a population is Normal,
    the sample std is a fairly good guess at the
    population std

So, we can say with some confidence, for example,
that 99.7 of the population lies between 1.44
and 2.16
35
Remember, MUCH of science relies on making
guesses about populations
  • The CLT helps us make the guesses reasonable
    rather than crazy.
  • Assuming normal dist, the stats of a sample tells
    us lots about the stats of the population

And, assuming normal dist helps us detect errors
and outliers how?
36
Z-normalisation (or z-score normalisation)
  • Given any collection of numbers (e.g. the values
    of a particular field in a dataset) we can work
    out the mean and the standard deviation.
  • Z-score normalisation means converting the
    numbers into units of standard deviation.

37
Simple z-normalisation example
38
Simple z-normalisation example
Mean 11.12 STD 5.93
39
Simple z-normalisation example
Mean 11.12 STD 5.93
subtract mean, so that these are centred around
zero
40
Simple z-normalisation example
Mean 11.12 STD 5.93
subtract mean, so that these are centred around
zero
Divide each value by the std we now see how
usual or unusual each value is
41
The take-home lesson (for those new to statistics)
  • Your data contains 100 values for x, and you
    have good reason to believe that x is normally
    distributed.
  • Thanks to the Central Limit Theorem, you can
  • Make a lot of good estimates about the statistics
    of the population
  • Make justified conclusions about two
    distributions being different (e.g. the
    distribution of field X for class 1, and the
    distribution of field X for class 2)
  • Maybe find outliers and spot other problems in
    the data

42
Next week back to baskets -- a classic Data
Mining Algorithm!
43
The Central Limit Theorem is this

As more and more samples are taken from a
population the distribution of the sample means
conforms to a normal distribution The average
of the samples more and more closely approximates
the average of the entire population A very
powerful and useful theorem The normal
distribution is such a common and useful
distribution that additional statistics have been
developed to measure how closely a population
conforms to it and to test for divergence from it
due to skewness and kurtosis
Write a Comment
User Comments (0)
About PowerShow.com