Aspects of Data - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Aspects of Data

Description:

MSCS 282: Data Mining - Aspects of Data. 3. Goals. Understand the structure ... [Person1], [Query1].[Gender], [Query1].[Parent1], [Query1].[Parent2], [Query2] ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 31
Provided by: CraigAS7
Learn more at: http://www.mscs.mu.edu
Category:
Tags: aspects | data | parent1

less

Transcript and Presenter's Notes

Title: Aspects of Data


1
Aspects of Data
  • MSCS282 Data Mining
  • Craig A. Struble, Ph.D.
  • Dept. of Math, Stat, and Comp. Sci.
  • Marquette University

2
Overview
  • About Data
  • Concepts, Instances, Attributes
  • Introduction to WEKA
  • Next time

3
Goals
  • Understand the structure and form of data
  • Learn standard terminology
  • Understand data format used by WEKA
  • Recognize typical problems in data

4
KDD
Selection Transformation
Cleaning Integration
Evaluation Visualization
Data Mining
Data Warehouse
Prepared data
Patterns
Knowledge
Knowledge Base
Data
5
What kinds of data can we mine?
  • Financial data (e.g. stock prices)
  • Transactional data (e.g. business)
  • Market baskets
  • Scientific data
  • Text
  • Images
  • Sound

6
Data Transformation
  • Data must be transformed into a format
    appropriate for the algorithms used
  • Single table, multiple tables
  • Phase space reconstruction
  • Typical tools
  • Programs (in Perl, Java, etc.)
  • Database (Oracle, DB2, etc.)
  • Other specialized software

7
Challenges of Real Data
  • Missing values
  • Inaccurate values
  • Duplicate instances
  • Correcting these problems is part of data
    cleaning
  • Too much data
  • Data selection and reduction

8
Data in this Class
  • Data is a (abstract) representation of
    observations
  • Data may be organized
  • Relational, object oriented, semi-structured
  • We will consider relational data

9
Relational Data
  • A relation is a function associating two or more
    values
  • In database terms, a relation is often defined by
    a table

10
Concepts
  • Data usually represents a concept, an idea or
    pattern to be mined
  • Concepts are generally equivalent to relations
  • In machine learning, the pattern is called a
    concept description
  • A concept description is just a fitted model that
    accurately describes the data
  • We will use the term model instead of concept
    description
  • For example, suppose we could create a model of
    the SisterOf concept from our data.

11
Organizing Concepts
  • Most data mining techniques require data in a
    single table
  • Denormalization
  • Frequently these single tables are stored in a
    single flat file
  • Flat file mining vs. database mining
  • Attribute value learning

12
Organizing Concepts
SELECT Query1.Person1, Query1.Gender,
Query1.Parent1, Query1.Parent2,
Query2.Person2, Query2.Gender,
Query2.Parent1, Query2.Parent2,
SisterOf.SisterOf? FROM Query2 INNER JOIN
(Query1 INNER JOIN SisterOf ON Query1.Person1
SisterOf.Person1) ON Query2.Person2Siste
rOf.Person2
13
Instances
  • Each concept consists of instances
  • Individual, independent example of concept
  • Positive instances represent the concept
  • Negative instances do not represent the concept

14
Instances
  • In the denormalized SisterOf table, only positive
    instances are included
  • All other combinations of people are assumed to
    be negative instances
  • Example of the closed world assumption
  • Only positive instances are specified
  • All other possible instances are negative
  • Assumes all possible cases covered

15
Attributes
  • Attributes characterize instances
  • Also called features
  • Often attributes that define a class label for
    classification are called goal attribute or
    response variable

16
Something to Ponder
  • Non-goal attributes may not uniquely characterize
    an instance
  • Can we fit a model perfectly to this data?

17
Attribute Types
  • Numeric/Continuous
  • Real or integer valued
  • 23.96, 10, -5, etc.
  • Interval
  • Ordered values with fixed and equal units
  • E.G. temperatures, years
  • Some math operations dont make sense
  • Sum of two years 1939 1945 3684
  • What do we do about adding B.C. and A.D. years?

18
Attribute Types
  • Numeric/Continuous
  • Ratio
  • The measurement scheme defines a meaningful zero
    point.
  • E.G. Measuring the distance between two points,
    angle between two points, cost of an item, etc.
  • All mathematical operations make sense.

19
Attribute Types
  • Nominal
  • Discrete values
  • Colonial, Bungalow, Cape Cod, Victorian, etc.
  • Ordinal
  • Discrete values with a notion of order but not
    distance
  • small lt medium lt large lt extra large
  • Boolean
  • Two values only
  • Yes and No, True or False, etc.

20
Attribute Types
  • Identify the attribute type of each attribute

21
Relational Data Mining
  • It is possible to mine data without
    denormalization
  • Inductive Logic Programming (ILP)
  • Currently, too slow to use on large data sets
  • Can mine data with infinite relations
  • I.E. relations that are recursively defined

22
Introduction to WEKA
  • Java implementation of several machine learning
    algorithms
  • Well be using WEKA this semester for data mining
  • Installed on studsys
  • Make sure
  • /usr/local/weka/weka-3-2-3/weka.jar
  • is in your CLASSPATH
  • Use JDK version 1.3 or better

23
WEKA
  • This is the startup window
  • Select Explorer for the GUI interface
  • Start with
  • java weka.gui.GUIChooser

24
WEKA Explorer
  • Open data files with Open file
  • Sample data sets are available in
    /usr/local/weka/weka-3-2-3/data

25
WEKA Explorer
  • After opening the weather.arff file
  • Select the Classify tab

26
WEKA Classification
Click to select classifier
Click to choose goal attribute
27
WEKA Classification
Validation results show up here.
Right click result in list to view decision tree.
28
Decision Tree
29
ARFF File Format
30
Next time
  • Data mining with decision trees
Write a Comment
User Comments (0)
About PowerShow.com