Themes in this session - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Themes in this session

Description:

symbols representing properties of events and their environments. Information ... A number of basic operations can be used for prediction and depiction. Classification ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 43

Provided by: L250

Category:

more less

Transcript and Presenter's Notes

Title: Themes in this session

1
Lecture 2

Themes in this session
Knowledge discovery in databases
Data mining
Multidimensional analysis and OLAP

2
Knowledge discovery in databases
3
What is Knowledge?

Data
symbols representing properties of events and
their environments
Information
is contained in descriptions, provides the
answers to a number of basic questions
Knowledge
basic know-how facilitates allows action
Understanding
achieved through diagnosis and prescription
Wisdom
judgement of what is efficient and effective

4
Characteristics of discovered knowledge

non-trivial
valid
novel
potential useful
understandable
An aggregated measure is interestingness
validity
novelty
usefulness
simplicity

5
A more formal definition of knowledge

Pattern
A pattern is an expression E in a language L
describing facts in a subset FE of F. E is called
a pattern if it is simpler than the enumeration
of all the facts in FE
Knowledge
A pattern E ? L is called knowledge if for some
user-specified threshold i ? Mi , I(E,F,C,N,U,S)
gt i
where C validity, N novelty, U usefulness,
S simplicity

6
What is KDD?

Knowledge Discovery in Databases involves the
extraction of implicit, previously unknown and
potentially useful information from data.
KDD is a process
involves the extraction, organisation and
presentation of discovered information
KDD is effected by a human-centred system
is in itself a knowledge intensive task
consisting of complex interactions between a
human and a (large) database.

7
Overview of the analysts tasks
Goals
Insight
gains
formulates
enriches
Queries
generates
Analyses
DB
Output
Dataset
8
Characteristics of the KDD process

highly iterative
protracted over time
numerous sub-tasks
highly complex
numerous input systems

9
A description of the KDD process
Task discovery
Data analysis
Model development
Data cleaning
Output generation
Goal formulation
Data discovery
10
Goal formulation

Based on a means-ends chain extending into the
workings of the organisation
Formulate a goal for improving the operations of
the business
Decide what one needs to know in order to fulfil
this goal and perform the business activity in a
better manner
On the basis of what one needs to know formulate
goals for how to discover this information by
using the KDD process
Revise all of the goals above if needs on the
basis of iterative discovery

11
Data discovery

Try and understand the domain in order to
determine which entities are relevant to the
discovery process
Check the coverage and content of the data
sift through the source data to see what is
available
sift through the source data to see what is not
available
Determine the quality of the data
Determine the structure of the data

12
Task discovery

Find means stipulated by the ends contained in
the knowledge discovery goals
Find out what the real requirements on the tasks
and the performance of these tasks are
Refine the requirements and choice of tasks until
youre sure youre setting about answering the
correct questions

13
Data cleaning

Ensure the quality of the data that will be used
in the KDD process
Eliminate data quality problems in the data such
as
inconsistencies due to differences between
various data sources
missing data
different forms of data representation
data incompatibility

14
Model development

Involves activities concerned with forming a
basic hypothesis which can satisfy the knowledge
discovery goals
Select the parameters for the model
formulate measures that can be used to quantify
achievement of the goal (outcome variable or
dependent variable)
select a set of independent variables which are
deemed to have relevance to the outcome variables
Segment the data
find possible relevant subsets in the population
Choose an analysis model which fits the problem
domain
NOTE This whole phase demands background
knowledge of the domain

15
Data analysis

Involves activities aimed at determining the
rules/reasons governing the behaviour of those
entities focused on by the knowledge discovery
goal
specify the chosen model
use some form of formal expression
fit the model to the data
perform initial adjustments to some of the
parameters
evaluate the model
check the soundness of the model against the data
refine the model
modify the model on the basis of its
discrepancies with the evidence presented by the
data

16
Output generation

Reports of findings in the analysis
Action suggestions on the basis of the findings
Models for use in similar analysis scenarios
Monitoring mechanisms which observe the variables
covered in the analysis and trigger
notifications when certain conditions are noted
in the data.

17
Developing KDD applications

Purpose an application to answer a key business
question
a labour intensive initial discovery of knowledge
by someone who understands the domain as well as
the specific data analysis techniques needed
encoding of the discovered knowledge within a
specific problem solving architecture
application of the knowledge in the context of a
real world task by a well understood class of
end-users
Installation of analysis, monitoring, and
reporting mechanisms as a base for continual
evaluation of data

18
Data mining
19
What is data mining?

Rather formal definition
Data mining involves fitting models to, and
observing patterns from, observed data through
the application of specific algorithms.
Less formally
Data analysis in order to explain an aspect of a
complex reality by expressing it as an
understandable simplification

20
Goals for data mining

Prediction
involve using some variables or fields in the
database to predict unknown or future values of
other variables of interest
Description
focuses on finding human interpretable patterns
describing the data

21
Rationale for data mining

Dramatic increase in the amount of data available
(the data explosion)
Increasing competition in the worlds market
The low relative value of easily discovered
information
Increasing cleverness
Emergence of new enabling technology

22
Enabling factors for data mining

Increased data storage ability
Increased data gathering ability
Increased processing power
The introduction of new computationally intensive
methods of machine learning

23
Background to data mining

Inductive learning
supervised learning
unsupervised learning
Statistics
Machine learning
Differences between DM and ML
DM finds understandable knowledge, ML improves
the performance of an agent
DM is concerned with large, real-world databases,
ML with smaller data sets
ML is a broader files, not only learning by
example

24
Data mining algorithms

Specific mix of three components
The model
function
representational form
parameters from the data
The model evaluation (preference) criterion
preference of one set of models or set of
parameters over another
based on goodness-of-fit function
The search method
a method for finding particular models and
parameters
Given data, family of models, preference
criterion

25
Primary operations in data mining

A number of basic operations can be used for
prediction and depiction
Classification
Regression
Clustering
Summarisation
Dependency modelling
Change and deviation detection

26
Classification

Learning a function that maps (classifies) a data
item into one of several predefined classes
In supervised learning it is the user that
defines the classes.
The classification is applied in the form of one
or more attributes that denotes the class of the
data item.
These classifying attributes are known as
predicted attributes. A combination of values for
the predicted attributes defines a class
Other attributes of the data item are known as
predicting attributes

27
Regression

A common statistical technique for modelling the
relationship between two or more variables
Learning a function which maps a data item to a
real-valued prediction variable
Simple linear regression uses the straight line
model Y ?0 ?1X ? , where Y is the
prediction variable (dependent variable) and X is
the predictive variable (independent variable)
Multiple regression involves more than two
variables and uses the model Y ?0 ?1X1 ?2X2
?nXn ? , where Y is the prediction variable
and X1 Xn are the predictive variables

28
Clustering

A common descriptive task for determining a
finite set of categories or clusters to describe
the data
Categories may be mutually descriptive and
exhaustive, or consist of richer representations
such as hierarchical or overlapping categories
A cluster is a group of objects grouped together
because of their similarity of proximity. Data
units in a cluster are both homogeneous and
differ significantly from other groups
Correlations and functions of distance between
elements are used in defining the clusters

29
Summarisation

Methods for finding a compact description for a
subset of data
Often relies on statistical methods such as the
calculating of means and standard derivations
Are often applied to interactive exploratory data
analysis and automated report generation.

30
Dependency modelling

Consists for finding a model which describes
significant dependencies between variables
There are two levels of dependency in dependency
models
The structural level specifies which variables
are locally dependent on each other
The quantitative level specifies the strengths of
the dependencies using some numerical scale
Often in the form x of all record containing
items A and B, also contain items D and E

31
Change and deviation detection

Focuses on discovering the most significant
changes in the data from previously measured or
normative values
Often used on a long time series of records in
order to discover trends
Often used to discover sequential patterns
occurring over extended time periods

32
Problems and issues in data mining

Limited information
Noise and missing values
Uncertainty
Size of databases
Irrelevance of certain fields
Updates to databases

33
Multidimensional analysis and OLAP
34
OLAP vs OLTP

OLTP servers handle mission-critical production
data accessed through simple queries
usually handles queries of an automated nature
OLTP applications consist of a large number of
relatively simple transactions.
Most often contains data organised on the basis
of logical relations between normalised tables
OLAP servers handle management-critical data
accessed through an iterative analytical
investigation
usually handles queries of an ad-hoc nature
supports more complex and demanding transactions
contains logically organised data in multiple
dimensions

35
What is OLAP?

Definition The dynamic synthesis, analysis and
consolidation of large volumes of
multidimensional data.
Flexible information synthesis
Multiple data dimensions/consolidation paths
Dynamic data analysis

36
Codds four data models for data analysis