Title: Aspects of Data
1Aspects of Data
- MSCS282 Data Mining
- Craig A. Struble, Ph.D.
- Dept. of Math, Stat, and Comp. Sci.
- Marquette University
2Overview
- About Data
- Concepts, Instances, Attributes
- Introduction to WEKA
- Next time
3Goals
- Understand the structure and form of data
- Learn standard terminology
- Understand data format used by WEKA
- Recognize typical problems in data
4KDD
Selection Transformation
Cleaning Integration
Evaluation Visualization
Data Mining
Data Warehouse
Prepared data
Patterns
Knowledge
Knowledge Base
Data
5What kinds of data can we mine?
- Financial data (e.g. stock prices)
- Transactional data (e.g. business)
- Market baskets
- Scientific data
- Text
- Images
- Sound
6Data Transformation
- Data must be transformed into a format
appropriate for the algorithms used - Single table, multiple tables
- Phase space reconstruction
- Typical tools
- Programs (in Perl, Java, etc.)
- Database (Oracle, DB2, etc.)
- Other specialized software
7Challenges of Real Data
- Missing values
- Inaccurate values
- Duplicate instances
- Correcting these problems is part of data
cleaning - Too much data
- Data selection and reduction
8Data in this Class
- Data is a (abstract) representation of
observations - Data may be organized
- Relational, object oriented, semi-structured
- We will consider relational data
9Relational Data
- A relation is a function associating two or more
values - In database terms, a relation is often defined by
a table
10Concepts
- Data usually represents a concept, an idea or
pattern to be mined - Concepts are generally equivalent to relations
- In machine learning, the pattern is called a
concept description - A concept description is just a fitted model that
accurately describes the data - We will use the term model instead of concept
description - For example, suppose we could create a model of
the SisterOf concept from our data.
11Organizing Concepts
- Most data mining techniques require data in a
single table - Denormalization
- Frequently these single tables are stored in a
single flat file - Flat file mining vs. database mining
- Attribute value learning
12Organizing Concepts
SELECT Query1.Person1, Query1.Gender,
Query1.Parent1, Query1.Parent2,
Query2.Person2, Query2.Gender,
Query2.Parent1, Query2.Parent2,
SisterOf.SisterOf? FROM Query2 INNER JOIN
(Query1 INNER JOIN SisterOf ON Query1.Person1
SisterOf.Person1) ON Query2.Person2Siste
rOf.Person2
13Instances
- Each concept consists of instances
- Individual, independent example of concept
- Positive instances represent the concept
- Negative instances do not represent the concept
14Instances
- In the denormalized SisterOf table, only positive
instances are included - All other combinations of people are assumed to
be negative instances - Example of the closed world assumption
- Only positive instances are specified
- All other possible instances are negative
- Assumes all possible cases covered
15Attributes
- Attributes characterize instances
- Also called features
- Often attributes that define a class label for
classification are called goal attribute or
response variable
16Something to Ponder
- Non-goal attributes may not uniquely characterize
an instance - Can we fit a model perfectly to this data?
17Attribute Types
- Numeric/Continuous
- Real or integer valued
- 23.96, 10, -5, etc.
- Interval
- Ordered values with fixed and equal units
- E.G. temperatures, years
- Some math operations dont make sense
- Sum of two years 1939 1945 3684
- What do we do about adding B.C. and A.D. years?
18Attribute Types
- Numeric/Continuous
- Ratio
- The measurement scheme defines a meaningful zero
point. - E.G. Measuring the distance between two points,
angle between two points, cost of an item, etc. - All mathematical operations make sense.
19Attribute Types
- Nominal
- Discrete values
- Colonial, Bungalow, Cape Cod, Victorian, etc.
- Ordinal
- Discrete values with a notion of order but not
distance - small lt medium lt large lt extra large
- Boolean
- Two values only
- Yes and No, True or False, etc.
20Attribute Types
- Identify the attribute type of each attribute
21Relational Data Mining
- It is possible to mine data without
denormalization - Inductive Logic Programming (ILP)
- Currently, too slow to use on large data sets
- Can mine data with infinite relations
- I.E. relations that are recursively defined
22Introduction to WEKA
- Java implementation of several machine learning
algorithms - Well be using WEKA this semester for data mining
- Installed on studsys
- Make sure
- /usr/local/weka/weka-3-2-3/weka.jar
- is in your CLASSPATH
- Use JDK version 1.3 or better
23WEKA
- This is the startup window
- Select Explorer for the GUI interface
- Start with
- java weka.gui.GUIChooser
24WEKA Explorer
- Open data files with Open file
- Sample data sets are available in
/usr/local/weka/weka-3-2-3/data
25WEKA Explorer
- After opening the weather.arff file
- Select the Classify tab
26WEKA Classification
Click to select classifier
Click to choose goal attribute
27WEKA Classification
Validation results show up here.
Right click result in list to view decision tree.
28Decision Tree
29ARFF File Format
30Next time
- Data mining with decision trees