Themes in this session - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Themes in this session

Description:

symbols representing properties of events and their environments. Information ... A number of basic operations can be used for prediction and depiction. Classification ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 43
Provided by: L250
Category:

less

Transcript and Presenter's Notes

Title: Themes in this session


1
Lecture 2
  • Themes in this session
  • Knowledge discovery in databases
  • Data mining
  • Multidimensional analysis and OLAP

2
Knowledge discovery in databases
3
What is Knowledge?
  • Data
  • symbols representing properties of events and
    their environments
  • Information
  • is contained in descriptions, provides the
    answers to a number of basic questions
  • Knowledge
  • basic know-how facilitates allows action
  • Understanding
  • achieved through diagnosis and prescription
  • Wisdom
  • judgement of what is efficient and effective

4
Characteristics of discovered knowledge
  • non-trivial
  • valid
  • novel
  • potential useful
  • understandable
  • An aggregated measure is interestingness
  • validity
  • novelty
  • usefulness
  • simplicity

5
A more formal definition of knowledge
  • Pattern
  • A pattern is an expression E in a language L
    describing facts in a subset FE of F. E is called
    a pattern if it is simpler than the enumeration
    of all the facts in FE
  • Knowledge
  • A pattern E ? L is called knowledge if for some
    user-specified threshold i ? Mi , I(E,F,C,N,U,S)
    gt i
  • where C validity, N novelty, U usefulness,
    S simplicity

6
What is KDD?
  • Knowledge Discovery in Databases involves the
    extraction of implicit, previously unknown and
    potentially useful information from data.
  • KDD is a process
  • involves the extraction, organisation and
    presentation of discovered information
  • KDD is effected by a human-centred system
  • is in itself a knowledge intensive task
    consisting of complex interactions between a
    human and a (large) database.

7
Overview of the analysts tasks
Goals
Insight
gains
formulates
enriches
Queries
generates
Analyses
DB
Output
Dataset
8
Characteristics of the KDD process
  • highly iterative
  • protracted over time
  • numerous sub-tasks
  • highly complex
  • numerous input systems

9
A description of the KDD process
Task discovery
Data analysis
Model development
Data cleaning
Output generation
Goal formulation
Data discovery
10
Goal formulation
  • Based on a means-ends chain extending into the
    workings of the organisation
  • Formulate a goal for improving the operations of
    the business
  • Decide what one needs to know in order to fulfil
    this goal and perform the business activity in a
    better manner
  • On the basis of what one needs to know formulate
    goals for how to discover this information by
    using the KDD process
  • Revise all of the goals above if needs on the
    basis of iterative discovery

11
Data discovery
  • Try and understand the domain in order to
    determine which entities are relevant to the
    discovery process
  • Check the coverage and content of the data
  • sift through the source data to see what is
    available
  • sift through the source data to see what is not
    available
  • Determine the quality of the data
  • Determine the structure of the data

12
Task discovery
  • Find means stipulated by the ends contained in
    the knowledge discovery goals
  • Find out what the real requirements on the tasks
    and the performance of these tasks are
  • Refine the requirements and choice of tasks until
    youre sure youre setting about answering the
    correct questions

13
Data cleaning
  • Ensure the quality of the data that will be used
    in the KDD process
  • Eliminate data quality problems in the data such
    as
  • inconsistencies due to differences between
    various data sources
  • missing data
  • different forms of data representation
  • data incompatibility

14
Model development
  • Involves activities concerned with forming a
    basic hypothesis which can satisfy the knowledge
    discovery goals
  • Select the parameters for the model
  • formulate measures that can be used to quantify
    achievement of the goal (outcome variable or
    dependent variable)
  • select a set of independent variables which are
    deemed to have relevance to the outcome variables
  • Segment the data
  • find possible relevant subsets in the population
  • Choose an analysis model which fits the problem
    domain
  • NOTE This whole phase demands background
    knowledge of the domain

15
Data analysis
  • Involves activities aimed at determining the
    rules/reasons governing the behaviour of those
    entities focused on by the knowledge discovery
    goal
  • specify the chosen model
  • use some form of formal expression
  • fit the model to the data
  • perform initial adjustments to some of the
    parameters
  • evaluate the model
  • check the soundness of the model against the data
  • refine the model
  • modify the model on the basis of its
    discrepancies with the evidence presented by the
    data

16
Output generation
  • Reports of findings in the analysis
  • Action suggestions on the basis of the findings
  • Models for use in similar analysis scenarios
  • Monitoring mechanisms which observe the variables
    covered in the analysis and trigger
    notifications when certain conditions are noted
    in the data.

17
Developing KDD applications
  • Purpose an application to answer a key business
    question
  • a labour intensive initial discovery of knowledge
    by someone who understands the domain as well as
    the specific data analysis techniques needed
  • encoding of the discovered knowledge within a
    specific problem solving architecture
  • application of the knowledge in the context of a
    real world task by a well understood class of
    end-users
  • Installation of analysis, monitoring, and
    reporting mechanisms as a base for continual
    evaluation of data

18
Data mining
19
What is data mining?
  • Rather formal definition
  • Data mining involves fitting models to, and
    observing patterns from, observed data through
    the application of specific algorithms.
  • Less formally
  • Data analysis in order to explain an aspect of a
    complex reality by expressing it as an
    understandable simplification

20
Goals for data mining
  • Prediction
  • involve using some variables or fields in the
    database to predict unknown or future values of
    other variables of interest
  • Description
  • focuses on finding human interpretable patterns
    describing the data

21
Rationale for data mining
  • Dramatic increase in the amount of data available
    (the data explosion)
  • Increasing competition in the worlds market
  • The low relative value of easily discovered
    information
  • Increasing cleverness
  • Emergence of new enabling technology

22
Enabling factors for data mining
  • Increased data storage ability
  • Increased data gathering ability
  • Increased processing power
  • The introduction of new computationally intensive
    methods of machine learning

23
Background to data mining
  • Inductive learning
  • supervised learning
  • unsupervised learning
  • Statistics
  • Machine learning
  • Differences between DM and ML
  • DM finds understandable knowledge, ML improves
    the performance of an agent
  • DM is concerned with large, real-world databases,
    ML with smaller data sets
  • ML is a broader files, not only learning by
    example

24
Data mining algorithms
  • Specific mix of three components
  • The model
  • function
  • representational form
  • parameters from the data
  • The model evaluation (preference) criterion
  • preference of one set of models or set of
    parameters over another
  • based on goodness-of-fit function
  • The search method
  • a method for finding particular models and
    parameters
  • Given data, family of models, preference
    criterion

25
Primary operations in data mining
  • A number of basic operations can be used for
    prediction and depiction
  • Classification
  • Regression
  • Clustering
  • Summarisation
  • Dependency modelling
  • Change and deviation detection

26
Classification
  • Learning a function that maps (classifies) a data
    item into one of several predefined classes
  • In supervised learning it is the user that
    defines the classes.
  • The classification is applied in the form of one
    or more attributes that denotes the class of the
    data item.
  • These classifying attributes are known as
    predicted attributes. A combination of values for
    the predicted attributes defines a class
  • Other attributes of the data item are known as
    predicting attributes

27
Regression
  • A common statistical technique for modelling the
    relationship between two or more variables
  • Learning a function which maps a data item to a
    real-valued prediction variable
  • Simple linear regression uses the straight line
    model Y ?0 ?1X ? , where Y is the
    prediction variable (dependent variable) and X is
    the predictive variable (independent variable)
  • Multiple regression involves more than two
    variables and uses the model Y ?0 ?1X1 ?2X2
    ?nXn ? , where Y is the prediction variable
    and X1 Xn are the predictive variables

28
Clustering
  • A common descriptive task for determining a
    finite set of categories or clusters to describe
    the data
  • Categories may be mutually descriptive and
    exhaustive, or consist of richer representations
    such as hierarchical or overlapping categories
  • A cluster is a group of objects grouped together
    because of their similarity of proximity. Data
    units in a cluster are both homogeneous and
    differ significantly from other groups
  • Correlations and functions of distance between
    elements are used in defining the clusters

29
Summarisation
  • Methods for finding a compact description for a
    subset of data
  • Often relies on statistical methods such as the
    calculating of means and standard derivations
  • Are often applied to interactive exploratory data
    analysis and automated report generation.

30
Dependency modelling
  • Consists for finding a model which describes
    significant dependencies between variables
  • There are two levels of dependency in dependency
    models
  • The structural level specifies which variables
    are locally dependent on each other
  • The quantitative level specifies the strengths of
    the dependencies using some numerical scale
  • Often in the form x of all record containing
    items A and B, also contain items D and E

31
Change and deviation detection
  • Focuses on discovering the most significant
    changes in the data from previously measured or
    normative values
  • Often used on a long time series of records in
    order to discover trends
  • Often used to discover sequential patterns
    occurring over extended time periods

32
Problems and issues in data mining
  • Limited information
  • Noise and missing values
  • Uncertainty
  • Size of databases
  • Irrelevance of certain fields
  • Updates to databases

33
Multidimensional analysis and OLAP
34
OLAP vs OLTP
  • OLTP servers handle mission-critical production
    data accessed through simple queries
  • usually handles queries of an automated nature
  • OLTP applications consist of a large number of
    relatively simple transactions.
  • Most often contains data organised on the basis
    of logical relations between normalised tables
  • OLAP servers handle management-critical data
    accessed through an iterative analytical
    investigation
  • usually handles queries of an ad-hoc nature
  • supports more complex and demanding transactions
  • contains logically organised data in multiple
    dimensions

35
What is OLAP?
  • Definition The dynamic synthesis, analysis and
    consolidation of large volumes of
    multidimensional data.
  • Flexible information synthesis
  • Multiple data dimensions/consolidation paths
  • Dynamic data analysis

36
Codds four data models for data analysis
  • Categorical data models
  • Exegetical data models
  • Contemplative data models
  • Formulaic data models

37
Dimensionality revisited
38
OLAP Tool evaluation criteria (1-6)
  • Multidimensional conceptual view
  • Transparency
  • Accessibility
  • Consistent reporting performance
  • Client-Server architecture
  • Generic dimensionality

39
OLAP Tool evaluation criteria (7-12)
  • Dynamic Sparse Matrix handling
  • Multi-user support
  • Unrestricted cross-dimensional analysis
  • Intuitive data manipulation
  • Flexible reporting
  • Unlimited dimensions and aggregation levels

40
Functionality of OLAP tools
  • Drill-down
  • Drill-up
  • Roll-up or consolidation
  • Slicing and dicing by pivoting
  • Drill-through
  • Drill-across

41
An OLAP answer set
42
Different forms of OLAP
  • True OLAP
  • ROLAP (relational OLAP)
  • MOLAP (multidimensional OLAP)
Write a Comment
User Comments (0)
About PowerShow.com