Data%20Mining

About This Presentation

Transcript and Presenter's Notes

Title: Data%20Mining

1
Data Mining

Edward, Hong Zhang
CS Dept, SUNY, Albany
CSI 668, March,20. 2001

2
Presentation Outline

Motivation
Background (KDD Process)
Whats Data Mining?
Why Data Mining?
The Data Mining Process

Data Mining Algorithms
Data Mining Research Trend
Existing Systems
for Data Mining
Conclusions

3
Motivation Necessity is the mother of invention

Data explosion problem
Automated data collection tools, availability of
increasingly cheap storage devices and mature
database technology lead to
tremendous amounts of data stored in
database, data warehouses and other information
repositories.
We are drowning in data, but starving for
knowledge!
Data is everywhere
Understand and use dataan imminent task!
Solution Knowledge Discovery (Data warehousing
and data mining)

4
Evolution of Database Technology

1960s-1970s
Data collection, database creation, IMS and
network DBMS.
1970s-1980s
Relational data model, relational DBMS
implementation.
1980s-1990s
RDBMS, advanced data models (extended-relational,
OO,
deductive, etc.) and application-oriented DBMS
(spatial,
scientific, engineering, etc.).
1990s-right now
Data mining and data warehousing, multimedia
databases, and
Web-based database technology.

5
Background

Knowledge Discovery (KD)
the process of finding general
patterns/principles that summarize/explain a set
of "observations".
The Knowledge Discovery in Databases (KDD)
Very Large DataBases (VLDB) have become the
industry standard, making it impossible for human
beings to mine the data "by hand" to look for
interesting patterns. Automated tools are
therefore needed to help to extract these
patterns.

6
Background Cont.

The knowledge discovery in databases (KDD)
consists of 3 steps
Data Integration (Data Warehousing)
Collecting the target data observations from
the different data sources, removing noise from
the observations, and integrating them into an
appropriate format.
Data Mining (will be covered in detail)
Applying a concrete algorithm to find useful
and novel patterns in the integrated data.

7
Background Cont.

Pattern Evaluation
Interpreting mined patterns, evaluating them
according to usefulness/interestingness criteria,
and possibly using visualization tools to aid in
understanding the patterns graphically.
See KDD process graph below

8
Data Mining KDD process
Knowledge
Data mining the core of knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
9
What Is Data Mining?

Data Mining (knowledge discovery in databases)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information (knowledge) or patterns from data in
large databases, data warehouse or other
information repositories
What is not data mining?
(Deductive) query processing.
Expert systems or Machine Learning/statistical
programs
Online Analytical Processing (OLAP)
Software Agents
Data Mining Confluence of Multiple
Disciplines

10
Database, OLAP,
High Performance Computing
Data Mining
Visualization
Machine Learning (AI)
Pattern recognition
Statistics Modeling
Information Science
11
Why Data Mining? Potential Applications

Database analysis and decision support
System (DSS)
Market analysis and management
target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation.
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive
analysis.
Text mining (Text Databases, documents), key
words search and analysis.
DNA sequence analysis and gene expression.

12
Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Useful Pattern
Visualization Techniques
Data Analyst
Data Mining
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
13
Why Data Mining? Potential Applications (Cont.)

Internet Web Surf-Aid (Web Mining)
IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Sports
IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat.

14
The Data Mining Process

Data set
Data Mining System
training
Data Mining Algorithm
evaluation
model
prediction
Score model
Historical Training data
Results Pattern
New data
15
Examples of Discovered Patterns

Association rules find rules between
different attributes
98 of AOL users also have EBay accounts
Classification Classify data based on the
values in a classifying attribute
People age less than 40 and salary gt 40,000
trade on-line
Clustering Group data to form new classes
Users A and B access similar URLs, they belong to
the same group, which has similar user profiles.

16
Are All the Discovered Patterns Interesting?

A data mining system/query may generate thousands
of patterns, not all of them are interesting.
Suggested approach Query-based, focused mining
Interestingness measures A pattern is
interesting if it is
easily understood by humans
valid on new or test data with some degree of
certainty.
potentially useful
novel, or validates some hypothesis that a user
seeks to confirm

17
How can we Find All and Only Interesting Patterns?

Find all the interesting patterns Completeness.
Can a data mining system find all the interesting
patterns?
Search only interesting patterns Optimization.
Can a data mining system find only the
interesting patterns?
Approaches
First generate all the patterns and then filter
out the uninteresting ones.
Generate only the interesting patterns --- mining
query optimization

18
Data Mining Algorithms

Four common DM algorithm types
The k-Nearest Neighbor Algorithm (KNN)
Artificial Neural Network (ANN)
Rule Induction
Decision Trees

19
The k-Nearest Neighbor Algorithm (KNN)

A technique that classifies each record in a
dataset based on a combination of the classes of
the k record(s) most similar to it in a
historical dataset
Use entire training database as the model
Find nearest data
point and do the
same thing as you
did for that record

-
.
-
-

-

xq

-
20
The k-Nearest Neighbor Algorithm (KNN) (Cont.)

Distance-weighted nearest neighbor algorithm.
Weight the contribution of each of the k
neighbors according to their distance to the
query point Xq.
giving greater weight to closer neighbors
Advantages
Calculate the mean values of the k nearest
neighbors.
Robust to noisy data by averaging k-nearest
neighbors.
Very easy to implement.
Disadvantage
Huge Models ( the entire training database )
More difficult to use in production.

21
Artificial neural networks Algorithm (ANN)

Non-linear predictive models that learn through
training and loosely resemble biological neural
networks in structure.
Inputs transformed through a network of simple
processors
Processor combines (weighted) inputs and produces
an output value

22
Artificial neural networks (Cont.)
mk
-
(Learning Rate)
x0
w0
x1
w1
f
å
output y
xn
wn
Input vector x
weight vector w
weighted sum
Activation function

The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping

23
Multi layer perception of Artificial neural
networks
Output vector
Output nodes
Hidden nodes
Input nodes
Input vector xi
24
Artificial Neural Network evaluation

Advantages
prediction accuracy is generally high
robust,still works when training examples contain
errors
Disadvantages
Key problem Difficult to understand
The neural network model is difficult
to understand
No intuitive understanding of results
Long training time
Although after training, process is very quick,
the training process itself is
time-consuming
Significant pre-processing of data often required

25
Rule Induction

Rule Induction (rule-based prediction)
We first generate a set of rules from a data
warehouse,
then use them to predict values for new data
item.
It works much better on larger (and real)data
sets, not just on samples of data.
Two phases
Rule discovery analyze a historical database
and generate a set of rules by automatic
discovery.
Prediction apply the rules to a new data set
and match the rules to make predictions.

26
Rule Induction Example
Training Set
27
Rule Induction Example (Cont.)

4 attributes
Outlook can be sunny, overcast, rainy 3
cases
Temperature hot, mild, cool
3 cases
Humidity high, normal
2 cases
Windy true, false
2 cases
1 outcome class (N no class, P have class)
Totally we should have 332236 possible
combinations, of which 14 are present in the
set of input examples.

28
Rule Induction Example (Cont.)

Some rules inducted from above dataset
Classification rules
If outlook sunny and humidity high then
class n.
If outlook rainy and windy true then
class n
if outlook overcast
then class p
Association rules
If temperature cool then humidity
normal
If windyfalse and classn then outlook
sunny and
humidity high

29
What is a decision tree?

A decision tree is a flow-chart-like tree
structure.
Internal node denotes a test on an attribute
Branch represents an outcome of the test
All tuples in branch have the same value for the
tested attribute.
Leaf node represents class label or class label
distribution.
A series of nested if/then rules
Understandable!

30
A Sample Decision Tree
The same Training set with Rule Induction
Outlook
sunny
rain
overcast
humidity
windy
P
true
false
high
normal
N
P
N
P
31
Another Example for DT
If x1 and y0 then class a If x0 and y1
then class a If x0 and y0 then class
b If x1 and y1 then class b
32
Another Example for DT
Credit Analysis
salary lt 20000

Yes
no

education in graduate

accept
no
yes

reject
accept
33
Decision-Tree Classification Methods

The basic top-down decision tree generation
approach usually consists of two phases
Tree construction
At start, all the training examples are at the
root.
Partition examples recursively based on selected
attributes.
Tree pruning
Aiming at removing tree branches that may lead to
errors when classifying test data (training data
may contain noise, statistical fluctuations, )

34
How to construct a tree?

Algorithm
greedy algorithm
make optimal choice at each step select the best
attribute for each tree node.
top-down recursive divide-and-conquer manner
from root to leaf
split node to several branches
for each branch, recursively run the algorithm

35
How to prune a tree

A decision tree constructed using the training
data may have too many branches/leaf nodes.
Caused by noise, overfitting
May result poor accuracy for unseen samples
Prune the tree merge a subtree into a leaf node.
Using a set of data different from the training
data.
At a tree node, if the accuracy without splitting
is higher than the accuracy with splitting,
replace the subtree with a leaf node, label it
using the majority class.

36
How to use a tree?

Directly
test the attribute value of unknown sample
against the tree.
A path is traced from root to a leaf which holds
the label
Indirectly
decision tree is converted to classification
rules
one rule is created for each path from the root
to a leaf
IF-THEN is easier for humans to understand

37
Decision tree for a covering algorithm
38
Data Mining Algorithm Summary

KNN
Quick and easy
Models tend to be very large
ANN
Difficult to interpret
Can require significant amounts of time to train
Rule Induction
Understandable
Need to limit calculations

Decision Trees
Understandable
Relatively fast
Other DM Technologies
Genetic Algorithms
Rough sets
Bayesian networks
Mixture models
Many more...

39
Data Mining Research Trend

Text mining Text database and information
retrieval
Multimedia data mining
OLAM (OLAP Mining)
Web mining (Data Mining and WWW)
E-commerce
Information retrieval (search)
Network management

40
Why Mine the Web?

Web A huge, widely-distributed, highly
heterogeneous, semi-structured,
hypertext/hypermedia, interconnected, evolving
information repository.
Web is a huge collection of documents plus
Hyper-link information
Access and usage information
Enormous wealth of information on Web
Financial information (e.g. stock quotes)
Book/CD/Video stores (e.g. Amazon)
Restaurant information (e.g. Zagats)
Car prices (e.g. Carpoint)
Lots of data on user access patterns
Web logs contain sequence of URLs accessed by
users

41
Why is Web Mining Different?

Huge The Web is a huge collection of documents
except for
Hyper-link information
Access and usage information
DynamicThe Web is very dynamic
New pages are constantly being generated
Unstructured Complexity of Web pages far
greater than text document collection
Challenge Develop new Web mining algorithms and
adapt traditional data mining algorithms to
Exploit hyper-links and access patterns
Be incremental

42
Types of Web Mining
43
Web Mining Applications

E-commerce (Infrastructure)
Generate user profiles
Targetted advertizing
Fraud detection
Similar image retrieval
Information retrieval (Search) on the Web
Automated generation of topic hierarchies
Web knowledge bases
Extraction of schema for XML documents
Network Management
Performance management
Fault management

44
Existing Systems for Data Mining

IBM Intelligent Miner.
SAS Institute Enterprise Miner.
Silicon Graphics MineSet.
Integral Solutions Ltd. Clementine.
Information Discovery Inc.
Data Mining Suite.
DBMiner Technology Inc. DBMiner
Rutger DataMine, GMD Explora, Univ. Munich
VisDB

45
Microsoft OLE DB for Data Mining

Microsoft OLE, OLE DB, OLE DB for OLAP and OLE DB
for Data Mining
OLE DB for DM Standardization July 1999 to March
2000
Microsoft SQL Server 2000 Analysis manager
Analysis manager consists of OLAP and Data Mining
Data mining two modules (Classification/Predictio
n and clustering)
OLDB for DM Data mining providers (such as
association modules and other classification or
clustering modules)

46
Research Progress for Data Mining in the Last
Decade

Multi-dimensional data analysis Data warehouse
and OLAP (on-line analytical processing)
Association, correlation, and causality analysis
Classification scalability and new approaches
Clustering and outlier analysis
Sequential patterns and time-series analysis
Text mining, Web mining and Weblog analysis
Spatial, multimedia, scientific data analysis
Data preprocessing and database compression
Data visualization and visual data mining

47
Conclusions

Knowledge Discovery in Databases (KDD)
Data warehouse An industry trend
DW stores a huge amount of subject-oriented,
cleansed, integrated, consolidated, time-related
data.
Data Mining A rich, promising, young field with
broad applications and many challenging research
issues. Good science - leading position in
research community

48
Conclusions (Cont.)

Data mining tasks characterization, association,
classification, clustering, prediction, sequence
and pattern analysis, etc.
Data mining Algorithms
The k-Nearest Neighbor Algorithm (KNN)
Artificial Neural Network (ANN)
Rule Induction
Decision Trees
Research progress and trend in Data Mining

49
Future Work

Theoretical foundations of data mining.
Implementation and new data mining methodologies
A set of well-tuned, standard mining operators.
Data and knowledge visualization tools.
Integration of multiple data mining strategies.
Data mining in advanced information systems
Spatial, multimedia, Web-mining
Data mining applications
content browsing, query optimization,
multi-resolution model, etc.
Social issues A threat to security and privacy.

Write a Comment

User Comments (0)

About PowerShow.com

Data%20Mining PowerPoint PPT Presentation