Lab 3 - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Lab 3

Description:

8. drive-wheels: 9. engine-location: 10. wheel-base: 11. length: 12. width: 13. height: ... 16. num-of-cylinders: eight, five, four, six, three, twelve, two. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 38
Provided by: drtehy
Category:
Tags: lab

less

Transcript and Presenter's Notes

Title: Lab 3


1
Lab 3

2
Data Understanding
3
Data Understanding
  • data mining methodology (CRISP-DM)
  • finding initial relationships
  • finding out the very first view understanding
    into the existing data
  • interesting patterns

4
COLLECT INITIAL DATA
  • involves the identification of
  • relevant attributes or
  • factors.

5
Example
  • You want to play outdoor golf with your friends.
  • What are the relevant attributes or factors
    considered?

6
  • four possible relevant attributes or factors
  • outlook,
  • temperature,
  • humidity,
  • wind
  • to consider whether to play or not

7
(No Transcript)
8
DESCRIBE DATA
  • The meta data view of Clementine consists of
  • type (Range, Set, Flag)
  • Values
  • total number of sample size
  • the total number of attribute types

9
DESCRIBE DATA
10
DESCRIBE DATA
  • There are 14 instances or examples
  • Outlook can be sunny, overcast, or rain
  • wind can be true or false.
  • Temperature and humidity have values that are
    numbers (such as Range)
  • Play can be yes or no.

11
EXPLORE DATA
  • Data exploration
  • Visualisation (distribution, scatter plot).
  • To provide the
  • first patterns or
  • correlation among attributes.
  • To show
  • the distribution of attributes or factors,
  • the pair of numbers of attributes over target
    attribute

12
Distribution
13
hypothesis
  • By exploring the data,
  • form one hypothesis
  • (If outlook overcast, then play yes).
  • But, you cannot fully explain all of the
    relationships yet.
  • This is why, we would like to introduce the
    modeling

14
(No Transcript)
15
C5.0
16
IRIS.txt (Example)
17
Source http//en.wikipedia.org/wiki/Sepal
18
Iris
  • Iris-setosa (Source http//www.badbear.com/signa/
    signa.pl?Iris-setosa)
  • Iris-versicolor (Source http//en.wikipedia.org/w
    iki/Iris_versicolor)
  • Iris-virginica (Source http//plants.usda.gov/ja
    va/profile?symbolIRVI)

19
Distribution
20
Scatter
21
Scatter
22
Decision Rules
23
Quality data
  • High quality data
  • If they are fit for their intended uses in
    operations, decision making and planning.
  • Poor quality data may lead to uninteresting
    patterns.

24
Data Quality Team
  • The data quality team should screen/check the
    data carefully to ensure the quality and quantity
    of data being gathered which meet the business
    objectives to ensure a
  • successful data mining project.

25
data analysts and business analysts
  • Olson says that data quality team should consist
    of data analysts and business analysts.

26
Data analysts
  • should be good at data architecture and
  • know how to apply tools to navigate large volumes
    of data
  • to find out patterns relevant to data quality.
  • Source http//download.oracle.com/docs/cd/B10500
    _01/server.920/a96520/schemas.htm

27
Business analysts
  • to understand the best practices of business and
    current business processes
  • to find out patterns relevant to data quality as
    well

28
DATA CLEANING
  • large volumes of data during the data cleaning
    process
  • some missing values are possible in the databases
    or files.

29
Exercise 1
30
Patients.txt (Possible Attribute Name)
  • Variable
  • Name Description
  • PATNO Patient Number
  • GENDER Gender
  • VISIT Visit Date
  • HR Heart Rate
  • SBP Systolic Blood Pres.
  • DBP Diastolic Blood Pres.
  • DX Diagnosis Code
  • AE Adverse Event

31
Credit Approval
  • A1
  • A2
  • A3
  • A4
  • A5
  • A6
  • A7
  • A8
  • A9
  • A10
  • A11
  • A12
  • A13
  • A14
  • A15
  • A16

32
Auto
  • 1. symboling
  • 2. normalized-losses
  • 3. make
  • 4. fuel-type
  • 5. aspiration
  • 6. num-of-doors
  • 7. body-style
  • 8. drive-wheels
  • 9. engine-location
  • 10. wheel-base
  • 11. length
  • 12. width
  • 13. height
  • 14. curb-weight
  • 15. engine-type
  • 16. num-of-cylinders
  • 17. engine-size
  • 18. fuel-system
  • 19. bore

33
Patients.txt
  • Variable
  • Name Description Type Valid
    Values
  • PATNO Patient Number Character
    Numerals
  • GENDER Gender Character M' or 'F'
  • VISIT Visit Date MMDDYY10 Any valid
    date
  • HR Heart Rate Numeric 40 to 100
  • SBP Systolic Blood Pres. Numeric 80 to
    200
  • DBP Diastolic Blood Pres. Numeric 60 to
    120
  • DX Diagnosis Code Character 1 to 3
    digits
  • AE Adverse Event Character '0' or '1'

34
Credit Approval
  • A1 b, a.
  • A2 continuous.
  • A3 continuous.
  • A4 u, y, l, t.
  • A5 g, p, gg.
  • A6 c, d, cc, i, j, k, m, r, q, w, x, e, aa,
    ff.
  • A7 v, h, bb, j, n, z, dd, ff, o.
  • A8 continuous.
  • A9 t, f.
  • A10 t, f.
  • A11 continuous.
  • A12 t, f.
  • A13 g, p, s.
  • A14 continuous.
  • A15 continuous.
  • A16 ,- (class attribute)

35
Auto
  • 1. symboling -3, -2, -1, 0, 1, 2,
    3.
  • 2. normalized-losses continuous from 65
    to 256.
  • 3. make alfa-romero, audi,
    bmw, chevrolet, dodge, honda,
  • isuzu, jaguar,
    mazda, mercedes-benz, mercury,
  • mitsubishi,
    nissan, peugot, plymouth, porsche,
  • renault, saab,
    subaru, toyota, volkswagen, volvo
  • 4. fuel-type diesel, gas.
  • 5. aspiration std, turbo.
  • 6. num-of-doors four, two.
  • 7. body-style hardtop, wagon,
    sedan, hatchback, convertible.
  • 8. drive-wheels 4wd, fwd, rwd.
  • 9. engine-location front, rear.
  • 10. wheel-base continuous from
    86.6 120.9.
  • 11. length continuous from
    141.1 to 208.1.
  • 12. width continuous from
    60.3 to 72.3.
  • 13. height continuous from
    47.8 to 59.8.
  • 14. curb-weight continuous from
    1488 to 4066.
  • 15. engine-type dohc, dohcv, l,
    ohc, ohcf, ohcv, rotor.
  • 16. num-of-cylinders eight, five, four,
    six, three, twelve, two.

36
Credit Approval (Missing Values)
  • Missing Attribute Values
  • 37 cases (5) have one or more missing
    values. The missing
  • values from particular attributes are
  • A1 12
  • A2 12
  • A4 6
  • A5 6
  • A6 9
  • A7 9
  • A14 13

37
Auto (Missing Values)
  • 8. Missing Attribute Values (denoted by "?")
  • Attribute Number of instances missing a
    value
  • 2. 41
  • 6. 2
  • 19. 4
  • 20. 4
  • 22. 2
  • 23. 2
  • 26. 4
Write a Comment
User Comments (0)
About PowerShow.com