Data Mining Query Languages - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining Query Languages

Description:

Data Mining Query Languages Kristen LeFevre April 19, 2004 With Thanks to Zheng Huang and Lei Chen Outline Introduce the problem of querying data mining models ... – PowerPoint PPT presentation

Number of Views:639
Avg rating:3.0/5.0
Slides: 31
Provided by: researchC2
Category:
Tags: data | languages | mining | query

less

Transcript and Presenter's Notes

Title: Data Mining Query Languages


1
Data Mining Query Languages
  • Kristen LeFevre
  • April 19, 2004
  • With Thanks to Zheng Huang and Lei Chen

2
Outline
  • Introduce the problem of querying data mining
    models
  • Overview of three different solutions and their
    contributions
  • Topic for Discussion What would an ideal
    solution support?

3
Problem Description
  • You guys are armed with two powerful tools
  • Database management systems
  • Efficient and effective data mining algorithms
    and frameworks
  • Generally, this work asks
  • How can we merge the two?
  • How can we integrate data mining more closely
    with traditional database systems, particularly
    querying?

4
Three Different Answers
  • DMQL A Data Mining Query Language for Relational
    Databases (Han et al, Simon Fraser University)
  • Integrating Data Mining with SQL Databases OLE
    DB for Data Mining (Netz et al, Microsoft)
  • MSQL A Query Language for Database Mining
    (Imielinski Virmani, Rutgers University)

5
Some Common Ground
  • Create and manipulate data mining models through
    a SQL-based interface (Command-driven data
    mining)
  • Abstract away the data mining particulars
  • Data mining should be performed on data in the
    database (should not need to export to a
    special-purpose environment)
  • Approaches differ on what kinds of models should
    be created, and what operations we should be able
    to perform

6
DMQL
  • Commands specify the following
  • The set of data relevant to the data mining task
    (the training set)
  • The kinds of knowledge to be discovered
  • Generalized relation
  • Characteristic rules
  • Discriminant rules
  • Classification rules
  • Association rules

7
DMQL
  • Commands Specify the following
  • Background knowledge
  • Concept hierarchies based on attribute
    relationships, etc.
  • Various thresholds
  • Minimum support, confidence, etc.

8
DMQL
  • Syntax
  • use database ltdatabase_namegt
  • use hierarchy lthierarchy_namegt for ltattributegt
  • ltrule_specgt
  • related to ltattr_or_agg_listgt
  • from ltrelation(s)gt
  • where ltconditionsgt
  • order by ltorder listgt
  • with ltkinds ofgt threshold ltthreshold_valuegt
    for ltattribute(s)gt

9
DMQL
  • Syntax ltrule_specgt
  • find classification rules as ltrule_namegt
  • according to ltattributesgt
  • Find association rules as ltrule_namegt
  • generalize data into ltrelation_namegt
  • others

10
DMQL
  • use database Hospital
  • find association rules as Heart_Health
  • related to Salary, Age, Smoker, Heart_Disease
  • from Patient_Financial f, Patient_Medical m
  • where f.ID m.ID and m.age gt 18
  • with support threshold .05
  • with confidence threshold .7

11
DMQL
  • DMQL provides a display in command to view
    resulting rules, but no advanced way to query
    them
  • Suggests that a GUI interface might aid in the
    presentation of these results in different forms
    (charts, graphs, etc.)

12
MSQL
  • Focus on Association Rules
  • Seeks to provide a language both to selectively
    generate rules, and separately to query the rule
    base
  • Expressive rule generation language, and
    techniques for optimizing some commands

13
MSQL
  • Get-Rules and Select-Rules Queries
  • Get-Rules operator generates rules over elements
    of argument class C, which satisfy conditions
    described in the where clause
  • Project Body, Consequent, confidence, support
  • GetRules(C) as R1
  • into ltrulebase_namegt
  • where ltcondsgt
  • sql-group-by clause
  • using-clause

14
MSQL
  • ltcondsgt may contain a number of conditions,
    including
  • restrictions on the attributes in the body or
    consequent
  • rule.body HAS (Job Doctor
  • rule1.consequent IN rule2.body
  • rule.consequent IS Age
  • pruning conditions (restrict by support,
    confidence, or size)
  • Stratified or correlated subqueries

in, has, and is are rule subset, superset, and
equality respectively
15
MSQL
  • GetRules(Patients)
  • where Body has Age
  • and Support gt .05 and Confidence gt .7
  • and not exists ( GetRules(Patients)
  • Support gt .05 and Confidence gt .7
  • and R2.Body HAS R1.Body)

Retrieve all rules with descriptors of the form
Age x in the body, except when there is a
rule with equal or greater support and confidence
with a rule containing a superset of the
descriptors in the body
16
MSQL
  • GetRules(C) R1
  • where ltpruning-condsgt
  • and not exists ( GetRules(C) R2
  • where ltsame pruning-condsgt
  • and R2.Body HAS R1.Body)

correlated
GetRules(C) R1 where ltpruning-condsgt and
consequent is (X) and consequent in
(SelectRules(R2) where consequent is
(X)
stratified
17
MSQL
  • Nested Get-Rules Queries and their optimization
  • Stratified (non-corrolated) queries are evaluated
    bottom-up. The subquery is evaluated first,
    and replaced with its results in the outer query.
  • Correlated queries are evaluated either top-down
    or bottom-up (like loop-unfolding), and there
    are rules for choosing between the two options

18
MSQL
GetRules(Patients) where Body has Age and
Support gt .05 and Confidence gt .7 and not exists
( GetRules(Patients) Support gt .05 and
Confidence gt .7 and R2.Body HAS R1.Body)
19
MSQL
Top-Down Evaluation
GetRules(Patients) where Body has Age and
Support gt .05 and Confidence gt .7
For each rule produced by the outer, evaluate the
inner
not exists ( GetRules(Patients) Support gt .05
and Confidence gt .7 and R2.Body HAS
R1.Body)
20
MSQL
Bottom-Up Evaluation
not exists ( GetRules(Patients) Support gt .05
and Confidence gt .7 and R2.Body HAS
R1.Body)
For each rule produced by the inner, evaluate the
outer
GetRules(Patients) where Body has Age and
Support gt .05 and Confidence gt .7
21
MSQL
  • Choosing between the two
  • In general, evaluate the expression with more
    restrictive conditions first
  • Heuristic rules
  • Evaluate the query with higher support threshold
    first
  • Next consider confidence threshold
  • A (length x) expression is in general more
    restrictive than (length gt x), which is more
    restrictive than (length lt x)
  • Body IS (constant expression) is more
    restrictive than Body HAS, which is more
    restrictive than Body IN
  • Next consider Consequent IN expressions
  • Descriptors of for (A a) are more restrictive
    than wildcards such as (A )

22
OLE DB for DM
  • An extension to the OLE DB interface for
    Microsoft SQL Server
  • Seeks to support the following ideas
  • Define a model by specifying the set of
    attributes to be predicted, the attributes used
    for the prediction, and the algorithm
  • Populate the model using the training data
  • Predict attributes for new data using the
    populated model
  • Browse the mining model (not fully addressed
    because it varies a lot by model type)

None of the others seemed to support this
23
OLE DB for DM
  • Defining a Mining Model
  • Identify the set of data attributes to be
    predicted, the set of attributes to be used for
    prediction, and the algorithm to be used for
    building the model
  • Populating the Model
  • Pull the information into a single rowset using
    views, and train the model using the data and
    algorithm specified
  • Supports complex objects, so rowset may be
    hierarchical (see paper for more complex examples)

24
OLE DB for DM
  • Using the mining model to predict
  • Defines a new operator prediction join. A model
    may be used to make predictions on datasets by
    taking the prediction join of the mining model
    and the data set.

25
OLE DB for DM
  • CREATE MINING MODEL Heart_Health Prediction
  • ID Int Key,
  • Age Int,
  • Smoker Int,
  • Salary Double discretized,
  • HeartAttack Int PREDICT, Prediction column
  • USING Decision_Trees_101

Identifies the source columns for the training
data, the column to be predicted, and the data
mining algorithm.
26
OLE DB for DM
  • INSERT INTO Heart_Health Prediction
  • (ID, Age, Smoker, Salary)
  • SELECT ID, Age, Smoker, Salary FROM
    Patient_Medical M, Patient_Financial F
  • WHERE M.ID F.ID

The INSERT represents using a tuple for training
the model (not actually inserting it into the
rowset).
27
OLE DB for DM
  • SELECT t.ID,
  • Heart_Health Prediction.HeartAttack
  • FROM Heart_Health Prediction
  • PREDICTION JOIN (
  • SELECT ID, Age, Smoker, Salary
  • FROM Patient_Medical M, Patient_Financial F
  • WHERE M.ID F.ID) as t
  • ON Heart_Health Prediction.Age t.Age AND
    Heath_Health Prediction.Smoker t.Smoker AND
    Heart_Health Prediction.Salary t.Salary

Prediction join connects the model and an actual
data table to make predictions
28
Key Ideas
  • Important to have an API for creating and
    manipulating data mining models
  • The data is already in the DBMS, so it makes
    sense to do the data mining where the data is
  • Applications already use SQL, so a SQL extension
    seems logical

29
Key Ideas
  • Need a method for defining data mining models,
    including algorithm specification, specification
    of various parameters, and training set
    specification (DMQL, MSQL, ODBDM)
  • Need a method of querying the models (MSQL)
  • Need a way of using the data mining model to
    interact with other data in the database, for
    purposes such as prediction (ODBDM)

30
Discussion Topic What Functionality would and
Ideal Solution Support?
Write a Comment
User Comments (0)
About PowerShow.com