Title: Delivering Business Intelligence With Microsoft SQL Server 2005 - Part 2
1Delivering Business Intelligence With Microsoft
SQL Server 2005- Part 2
2Summary of Topics
- Topic 1 What Is Data Mining?
-
- Topic 2 SQL Server Data Mining Architecture
Overview - Topic 3 Tasks Accomplished by Data Mining
- Topic 4 Steps for Data Mining
- Topic 5 Data Mining Algorithms
- Topic 6 Data Mining Structure Model
-
3Topic 1 What Is Data Mining?
- Data mining works in a manner similar to our
topographic map. It makes connections within the
data that may not be readily apparent to the
human observer. These connections make the data
easier to understand. Using this new
understanding, we can efficiently analyze the
data and set direction, so our organizations do
not tumble over a cliff.
4Making Connections
5Making Connections -- Planning a route using a
topographic map
6Making Predictions
- Data mining provides us with a similar
capability. We can use the patterns found in our
data to make predictions of what will happen
next. Data mining may find patterns in the way
our clients make use of our services. Based on
these patterns, we can predict which clients may
need additional services in the future. Data
mining may find patterns in the valid data
entered for a particular screen. Based on these
patterns, we can predict whether a newly entered
set of data is accurate or contains a data entry
error.
7Basic Data Mining Terminology
- The data mining industry uses a lot of terms
fairly loosely, so its helpful for us to define
a few of these terms early on. This is not an
exhaustive list of data mining terms, only the
relevant ones for our discussion. - Algorithm The programmatic technique used to
identify the relationships or patterns in the
data. - Model The definition of the relationship
identified by the algorithm, which generally
takes the form of a set of rules, a decision
tree, a set of equations, or a set of
associations. - Case The collection of attributes and
relationships (variables) that are associated
with an individual object, usually a customer.
The case is also known as an observation.
8Basic Data Mining Terminology
- Case set A group of cases that share the same
attributes. Think of a case set as a table with
one row per unique object (like customer). Its
possible to have a nested case set when one row
in the parent table, like customer, joins to
multiple rows in the nested table, like
purchases. The case set is also known as an
observation set. - Dependent variable(s) (or predicted attribute or
predict column) The variable the algorithm will
build a model to predict or classify.
9Basic Data Mining Terminology
- Independent variable(s) (or predictive attribute
or input column) The variable with descriptive
information used to build the model. The
algorithm creates a model that uses combinations
of independent variables to define a grouping or
predict the dependent variable.
10Basic Data Mining Terminology
- Discrete or Continuous Variables Numeric columns
that contain continuous or discrete values. A
column in the Employee table called Salary that
contains the actual salary values is a continuous
variable. You can add a column to the table
during data preparation called SalaryRange,
containing integers to represent encoded salary
ranges (1 0 to 25,000 2 between 25,000
and 50,000 and so on). This is a discrete
numeric column. Early data mining and statistical
analysis tools required the conversion of strings
to numeric values like the encoded salary ranges.
Most tools, including most of the SQL Server data
mining algorithms, allow the use of character
descriptions as discrete values. The string 0 to
25,000 is easier to understand than the number
1. Discrete variables are also known as
categorical. This distinction between discrete
and continuous is important to the underlying
algorithms in data mining, although its
significance is less obvious to those of us who
are not statisticians.
11Basic Data Mining Terminology
- Regression A statistical technique that creates
a best-fit formula based on a dataset. The
formula can be used to predict values based on
new input variables. In linear regression, the
formula is the equation for a line. - Deviation A measure of how well the regression
formula fits the actual values in the dataset
from which it was created.
12Basic Data Mining Terminology
- Mining Structure A Microsoft data mining term
used as a name for the definition of a case set
in Analysis Services. The mining structure is
essentially a metadata layer on top of a data
source view that includes additional data
mining-related flags and column properties, like
the field that identifies a column as input,
predict, both, or ignore. A mining structure can
be used as the basis for multiple mining models. - Mining Model The specific application of an
algorithm to a particular mining structure. You
can build several mining models with different
parameters or different algorithms from the same
mining structure.
13Topic 2 SQL Server Data Mining Architecture
Overview
14SQL Server Data Mining Architecture Overview
- Point A in Architecture Figure shows how data
mining models are built using the dimensional
engine, leveraging its ability to load data and
quickly perform the base statistical calculations
like sums, averages, and counts. The data mining
server can also easily pull case data from both
relational Analysis Services databases as seen at
point B in ArchitectureFigure.
15SQL Server Data Mining Architecture Overview
- Point C in Architecture Figure shows how the
typical developer will first experience data
mining by creating a BI Studio Analysis Services
project and then using the Data Mining Wizard to
create a new data mining structure and an initial
data mining model. The mining structure is a new
construct that provides a metadata layer allowing
several mining models to work with the same input
data. Each mining model in a mining structure can
have different algorithms and parameters. The
wizard provides model building guidance with auto
selection and adjustment of variables based on
the algorithm selected. The wizard also helps you
create case sets, including complex, nested
queries.
16Build, Deploy, and Process
- Most of the functions in the Data Mining Designer
work with the actual model as it exists in
Analysis Services. This means once the wizard is
complete, the developer must build and deploy the
project (which includes processing the model
cubes) before any more progress can be made.
Building the project writes the metadata out to
project files in the development environment. The
actual model does not come into being until the
project is deployed to an Analysis Services
instance. At that point, BI Studio creates a
database for the project in Analysis Services. It
writes out the model structure metadata and the
definition of each model. Finally, it creates a
cube for each model and processes the models,
inserting the training data so the algorithm can
calculate the rules, correlations, and other
relationships. Until the project is deployed and
a model is processed, it cannot be viewed in the
viewers.
17Build, Deploy, and Process
- Most of the functions in the Data Mining Designer
work with the actual model as it exists in
Analysis Services. This means once the wizard is
complete, the developer must build and deploy the
project (which includes processing the model
cubes) before any more progress can be made.
Building the project writes the metadata out to
project files in the development environment. The
actual model does not come into being until the
project is deployed to an Analysis Services
instance. At that point, BI Studio creates a
database for the project in Analysis Services. It
writes out the model structure metadata and the
definition of each model. Finally, it creates a
cube for each model and processes the models,
inserting the training data so the algorithm can
calculate the rules, correlations, and other
relationships. Until the project is deployed and
a model is processed, it cannot be viewed in the
viewers.
18Accessing the Mining Models
- As you see at Point D in ArchitectureFigure, Data
Mining eXtensions to SQL language (DMX) is at the
core of all the Microsoft data mining API. As the
name suggests, DMX is an extension to SQL
designed to create, train, modify, and query data
mining models. DMX was introduced with SQL Server
2000 as part of the OLE DB for Data Mining APIs.
It has been enhanced in SQL Server 2005 with
additional options for data sources and more
flexible SELECT functionality. An easy way to
begin learning about DMX is to use the Mining
Model Prediction tab in the Data Mining Designer
and examine the syntax it generates for DMX
queries. The code can be copied to a DMX query
window in SQL Studio for further exploration.
Although DMX is an extension to SQL, queries are
submitted to the Analysis Services serverthats
where the data mining services are.
19Accessing the Mining Models
- When the development environment submits its DMX
commands and queries to Analysis Services, it
uses the XML for Analysis (XMLA) APIs. In fact,
because Analysis Services is a native XMLA
server, the data mining models are available to
any application as a web service by using SOAP
protocols. It is still possible to access the
server with old OLE DB APIs, or ADO and ADO.NET.
20Integration Services and Data Mining
- Integration Services can play a major role in the
data mining process as shown in
ArchitectureFigure. Many of the standard
transforms used for data cleansing and data
integration are particularly valuable for
building the training and test case sets. Besides
the obvious tasks, like Data Conversion and
Derived Column, tasks like the Percentage
Sampling, Row Sampling, Conditional Split,
Lookup, and Merge Join are powerful components
the data miner can use to build a set of packages
to prepare the case sets for the data mining
process. This is shown at point E in
ArchitectureFigure. In addition to the standard
tasks, there are two Integration Services tasks
that directly interact with the data mining
models, shown at point F in ArchitectureFigure.
21Integration Services and Data Mining
- The Data Mining Model Training destination task
is the target for a training set used to train
(or re-train) an existing mining model. This
capability is perfect for the ongoing re-training
required to keep certain mining models
currentrecommendation models, for example. The
Data Mining Query task is specifically designed
to do prediction joins against a model in the IS
pipeline, once the model has been trained and
tested. This could be used to add scores to the
Customer table or identify significant data
anomalies during the nightly ETL process. It
could also be used in a real-time mode to flag
transactions that were potentially fraudulent.
22Additional Features
- There are several additional features that
will be important for certain applications. Most
of these can be found at point G in
ArchitectureFigure and are listed briefly here - Extensibility Microsoft has provided a set of
COM APIs that allow developers to integrate
additional data mining algorithms into the data
mining engine. They can integrate custom viewers
into the data mining designer as well. Someone
could even create a new viewer for an existing
Microsoft algorithm. - Analysis Management Objects (AMO) AMO is a new
API for managing the creation and maintenance of
data mining objects, including creating,
processing, backing up, restoring, and securing.
23Additional Features
- Stored procedures and user-defined functions A
developer can create what are essentially stored
procedures or user-defined functions and load
them as managed assemblies into Analysis
Services. This allows clients to work with large
mining models through the intermediate layer of
the server-based managed assembly. - Text mining It is possible to do some
interesting data mining on unstructured text
data, like the text in HTML files in a set of web
directories, or even text fields in a database.
For example, use the Integration Services Term
Extraction transformation to build a dictionary
of terms found in the text files. Then use data
mining to create classification rules to
categorize the documents according to the terms
they contain. This is essentially a data mining
application, not a new data mining algorithm, but
it has value in certain areas.
24Architecture Summary
- The previous slidfes show how data mining fits
into the overall SQL Server 2005 architecture,
and to show how well data mining has been
integrated into the SQL Server environment. It
should be clear by now that data mining is a
serious element of the BI toolset. Data mining
and application developers will have a lot more
to learn before they are proficient with all the
data mining components. Fortunately, Microsofts
documentation is heavily weighted toward the
development community. Additional help should be
easy to find. A good place to start is a web site
called www.sqlserverdatamining.com, maintained by
the SQL Server data mining development team.
25Topic 3 Tasks Accomplished by Data Mining
- Classification
- Regression
- Segmentation
- Association
- Sequence Analysis
- Probability Predictions
-
26Classification
- we may need to classify a potential customer as
someone who is likely to need our services or
someone who is not. If we know how to classify an
individual, an entity, or a thing, we can make
more intelligent decisions as we deal with that
individual, entity, or thing.
27Example of Classification using Tree Induction
Customer renting property gt 2 years
No
Yes
Rent property
Customer agegt45
No
Yes
Rent property
Buy property
28Example of Classification using Neural Induction
29Regression
- Classification is used to predict the value for a
discrete attribute, meaning an attribute that has
one of a set number of distinct values.
Regression, on the other hand, is used to predict
a continuous value. - To predict a continuous value, regression looks
at trends likely to continue and repeat over
time. - Like classification, regression also looks at
relationships between the value being predicted
and other continuous values available in the
data.
30Regression - Value Prediction
- Linear regression attempts to fit a straight line
through a plot of the data, such that the line is
the best representation of the average of all
observations at that point in the plot. - Problem is that the technique only works well
with linear data and is sensitive to the presence
of outliers (i.e.., data values, which do not
conform to the expected norm).
31Segmentation
- Segmentation is the "divide and conquer" approach
to data analysis. Segmentation divides data into
groups with similar characteristics. We can then
analyze the characteristics of each group for
insights. - Applications of database segmentation include
customer profiling, direct marketing, and cross
selling.
32Example of Database Segmentation using a Scatter
plot
33Association
- Association requires we have some type of
grouping in our data. Multiple items could be
grouped together in a single sales transaction.
Multiple services could be provided to a single
family unit. - Classification, segmentation, and association can
sometimes run together. They seem similar. The
difference is this With classification, we pick
the attribute that defines the grouping and the
algorithm determines what most distinguishes
those groupings. With segmentation, the algorithm
itself creates the grouping, based on what it
determines as distinguishing characteristics.
With association, the underlying data already
contains some type of grouping. The algorithm
makes determinations about items likely to be in
the group, rather than attributes they are likely
to have in common.
34Sequence Analysis
- Sequence analysis is often applied to navigation
on a website. A user is on Page A. The user is
then most likely to navigate to Page B, with a
somewhat smaller probability of navigating to
Page C. - Finds patterns between events such that the
presence of one set of items is followed by
another set of items in a database of events over
a period of time. - e.g. Used to understand long term customer buying
behaviour.
35Probability Predictions
- It mathematically analyzes what has occurred in
the past and determines what is most probable to
occur if present trends continue.
36Topic 4 Steps For Data Mining
- Problem Definition
- Data Preparation
- Training
- Validation
- Deploy
37Problem Definition
- The first step for a data mining solution is to
define the problem we are trying to solve. Data
mining is great at finding patterns and
relationships that are not apparent to human
eyes. However, it cannot solve problems unless we
point that analytical power in a specific
direction. - We need to define the business goal we are
working toward. Perhaps we want to do a better
job at identifying perspective customers who are
a bad credit risk. Perhaps we want to identify
customers who are likely to switch to our
competitor's products or services. Perhaps we
want to predict future sales to put in place the
capacity to meet those needs.
38Problem Definition
- Once the business goal is defined, we determine
what we need - data mining to do for us. We select which of the
five data mining - tasks we are looking to accomplish
- Classification
- Regression
- Segmentation
- Association
- Sequence Analysis
- Our choice of task influences our choice of data
mining - Algorithm to be discussed in topic 3 .
39Data Preparation
- When we learned about OLAP cube design, one of
our main concerns - was where the data to populate the cube was going
to come from. - With data mining, we have the same concern where
are we going to - get the data to be fed into our data mining
algorithms? -
40Training
- With algorithms selected and our data in place,
we begin the process of working with the data
mining model. You can think of a data mining
model as the combination of one or more data
mining algorithms with some real, live data. The
model applies the algorithms to the data and
creates the classifications, associations,
regression formulas, and so forth that we use to
solve our business problem.
41Training
- This step is known as training the data mining
model. We provide our data to the data mining
algorithms, so they can be trained on a
particular aspect of our organization. They
"learn" what is unique about our data, as well as
the operations and entities, customers, products,
and sales transactions that produced that data.
42Validation
- We spoke previously in this chapter about
providing a data mining algorithm with past data,
so it could find the classifications,
associations, and so forth contained within. In
actuality, we often divide our past data into two
sets. One set of data is used to train the data
model. The second set is used to validate the
model after it is trained. Let's return to our
credit risk example for an illustration. -
43Deploy
- Once we test the data mining model and determine
which data mining algorithm gives us the most
accurate information, we can put it to work. We
place the model where it can be accessed by the
front-end tools that will put it to work. We use
the Data Mining Extensions (DMX) query language
to feed new records to our algorithm and have it
generate information.
44Topic 5 Data Mining Algorithms
- SQL Server 2005 provides us with seven data
mining algorithms. Most of these algorithms
perform several different tasks. Having a
detailed understanding of the inner workings of
each algorithm is unnecessary. Instead, this
section provides a brief explanation of each to
give you some background information. More
important is the knowledge of what each can be
used to accomplish.
45Data Mining Algorithms
- Microsoft Decision Trees
- Microsoft Naïve Bayes
- Microsoft Clustering
- Microsoft Association
- Microsoft Sequence Clustering
- Microsoft Time Series
- Microsoft Neural Network
465.1 Microsoft Decision Trees
- Function The Microsoft Decision Trees
algorithm is one of the easiest algorithms to
understand because it creates a tree structure
during its training process. (You probably
already guessed that from the name.) The tree
structure is then used to provide predictions and
analysis. .
47The Microsoft Decision Trees Algorithm
48Tasks
- The main purpose of the Microsoft Decision Trees
- algorithm is Classification.
- It can also be used for Regression Association.
495.2 Microsoft Naïve Bayes
- Microsoft's SQL Product Manager, Donald Farmer,
claims because there is a Naïve Bayes algorithm,
there must be a "deeply cynical" Bayes algorithm
out there somewhere in data mining land. I guess
it is needed to bring balance to data mining's
version of the "force." We will try not to be too
Naïve as we explore the benefits and shortcomings
of this algorithm. -
50Function
- The Naïve Bayes algorithm looks at each attribute
of the entity in question and determines how that
attribute, on its own, affects the attribute we
are looking to predict. Figure 5.2 shows a Naïve
Bayes algorithm being used to predict whether a
customer is a good credit risk. One by one, the
Naïve Bayes algorithm takes a single attribute of
a customer, size of company, annual revenue, and
so forth and looks at the training data to
determine its effect on credit risk.
51Figure 5.2 The Naïve Bayes algorithm
52Function
- In our diagram, 57 of companies with a size
attribute of small are bad credit risks. Only 14
of companies with a size attribute of large are
bad credit risks. In this particular example, it
looks pretty cut and dried we should never
extend credit to small companies and we should
always extend credit to large companies.
53Tasks
- The Naïve Bayes algorithm can only be used for
Classification.
545.3 Microsoft Clustering
- The Microsoft Clustering algorithm builds
clusters of entities as it processes the training
data set. This is shown in Figure 5-3. Once the
clusters are created, the algorithm analyzes the
makeup of each cluster. It looks at the values of
each attribute for the entities in the cluster. - Cluster analysis lets the data miner assemble
data into unforeseen groups containing similar
characteristics. Also known as "segmentation,"
this type of data - mining is probably the most widely used.
-
55Figure 5.3 The Microsoft Clustering Algorithm
56Tasks
- The main purpose of the Microsoft
- Clustering algorithm is
- Segmentation
- Regression
- ClassificationTree Induction Neural Induction
57Database Segmentation
- Aim is to partition a database into an unknown
number of segments, or clusters, of similar
records. - Uses unsupervised learning to discover
homogeneous sub-populations in a database to
improve the accuracy of the profiles.
58Example of Database Segmentation using a Scatter
plot
59Example of Classification using Tree Induction
60Example of Classification using Neural Induction
615.4 Microsoft Association
- The Microsoft Association algorithm creates its
own sets of entities, and then determines how
often those sets occur in the test data set. This
is shown in Figure 5.4. In this set, we are
looking at groupings of products purchased
together in a single purchase transaction. To
simplify our example, we are only looking at
products in the World War II product subtype. -
62Figure 5.4 The Microsoft Association algorithm
63Tasks
- The Microsoft Association algorithm can
- only be used for Association.
645.5 Microsoft Sequence Clustering
- The Microsoft Sequence Clustering algorithm
examines the test data set to identify
transitions from one state to another. The test
data set contains data, such as a web log showing
navigation from one page to another or perhaps
routing and approval data showing the path taken
to approve each request. The algorithm uses the
test data set to determine, as a ratio, how many
times each of the possible paths is taken. -
65Tasks
- The main purpose of the Microsoft Sequence
Clustering algorithm is Sequence Analysis - Segmentation.
- It can also be used for Regression
Classification.
665.6 Microsoft Time Series
- The Microsoft Time Series algorithm starts with
time-related data in the test data set. In Figure
5.6, this is sales data for each month. To
simplify things, we are only looking at data for
two products in our example.
67Figure 5.6 The Microsoft Time Series algorithm
68Tasks
- The Microsoft Time Series algorithm can only be
used for Regression.
695.7 Microsoft Neural Network
- Neural networks were developed in the 1960s to
model the way human neurons function. Microsoft
has created the Microsoft Neural Network
algorithm, so we can use neural networks for such
mundane activities as predicting product sales.
Of course, predicting product sales might not
seem so mundane if your future employment is
dependent on being correct.
70Tasks
- The main purpose of the Microsoft Neural Network
- algorithm is
- Classification
- Regression
71Topic 6 Data Mining Structure Model
- The data mining structure enables us to define
the data we will use, either from a relational
database or from an OLAP cube, as the source for
the mining. It also lets us define one or more
data mining algorithms to apply to that data.
72Data Columns
- The first step in doing any mining is to select
the data being mined. We select a data source
either relational data or an OLAP cube. Once this
is done, we need to select a table if this is a
relational data source, or a dimension if this is
an OLAP data source. Finally, we must select the
table columns or dimension attributes to be used
as the data columns for our data mining.
73Data Columns
- Although an initial table or dimension must be
selected for data mining, it is possible to
select additional related tables or dimensions
for inclusion. These are known as nested tables.
If a relational database is being used, the
initial table and the nested table must be linked
by a foreign key relationship (or a chain of
foreign key relationships). If a cube is being
used, the initial dimension and the nested
dimension must be linked through the fact table.
74Data Mining Model
- The data mining model combines the data columns
with a data mining algorithm. Additionally, we
must determine how each data column should be
used by the data mining algorithm. This process
determines how the data mining algorithm
functions and what it predicts.
75Data Column Usage
- Each column in the data mining structure must
have one - of the following usages
- Key The key is the unique identifier for a
table or a dimension. The key is not used to
uniquely identify records or members and it is
not used by the data mining algorithm to predict
values. - Input Input columns are used by the data mining
algorithm when making a prediction. The input
values are the values that influence the
prediction made by the algorithm.
76Data Column Usage
- Predict A predict is a data column whose value
is being predicted by the data mining algorithm.
This column can also be used as an input column.
If data column A and data column B are both
predict, then data column A will serve as an
input when the algorithm is predicting data
column B, and vice versa.
77Data Column Usage
- Predict Only A predict only is a data column
whose value is being predicted by the data mining
algorithm. This column cannot be used as an input
column when predicting other columns. - Ignore This data column is not used by the data
mining algorithm.
78Using Cube SliceAs a Training Set
- The fact that data must be divided into training
and - testing data sets. The cube slice enables us to
specify an - expression that divides the cube into two parts.
The - portion of the data that satisfies the expression
is fed into - the data mining algorithm for training. The rest
of the - data will be testing data set.
79Defining a Data Mining Model
- Steps
- Business Need
- Select the data mining algorithms
- Determine where the data source
80Mining Model Viewer
- Now that we have several data mining models
created and trained, let's look at how we begin
to analyze the information. We do this by using
the data mining model viewers, which are found on
the Mining Model Viewer tab on the Data Mining
Design tab. - Each data mining algorithm has its own set of
viewers, enabling us to examine the trained data
mining algorithm. These viewers present the
mining information graphically for ease of
understanding. If desired, we can also use the
Microsoft Mining Content Viewer to look at the
raw data underlying the graphical presentations.
81Mining Model Viewer
- Microsoft Tree Viewer
- Microsoft Mining Content Viewerused to see the
detailed data underlying the diagrams
82The SQL Server Management Studio Windows
- Registered Servers Window
- Object Explorer Window
- Summary Window
- Query Windows
83Query Windows
- One of the functions of the SQL Server Management
Studio is the creation and execution of queries.
These queries may be Transact-SQL for relational
databases, MDX or XMLA for Analysis Services
databases, DMX for data mining, or SQL Mobile for
hand-held devices. Each type of query has its own
specific type of query window. We discuss how to
use these query windows throughout the rest of
this book.
84 MDX Definition
- Multidimensional Expression (MDX) language
provides the programming language for OLAP Cube
navigation and over 50 mathematical functions for
calculating cube measures. It can be called and
used in the relational database language of
Transact-SQL.
85DMX Definition
- Data Mining Expression (DMX) language is a new
language in SQL Server 2005, which provides the
commands to easily set up data mining. It can be
called and used in the relational database
language of Transact-SQL
86XML Definition
- XML for Analysis Services (XMLA) is an open,
XML-based standard protocol for interacting with
Microsoft SQL Server 2005 Analysis Services data
over an HTTP connection, such as an intranet or
the Internet. XMLA uses the Simple Object Access
Protocol (SOAP).
87Mining Accuracy Chart
- Before we put a data mining model to use in a
production environment, we need to insure the
model is making predictions with the desired
accuracy. This is done using the Mining Accuracy
Chart tab in the Business Intelligence
Development Studio. The Mining Accuracy Chart tab
provides three tools for determining the accuracy
of our mining models the Lift Chart, the Profit
Chart, and the Classification Matrix.
88Deployment
- As you see in Figure 6.1, the deployment step in
the Business Dimensional Lifecycle is where the
three parallel development tracks come back
together. This is the great unveiling of the
DW/BI system to the business community. The
quality of this first impression will strongly
influence the acceptance of the systemand you
get only one shot at it. Like any big event,
there are a lot of details that must fall into
place in order for the show to be successful.
89Figure 6.1 The Deployment step in the Business
Development Lifecycle
90Deployment
- What needs to be tested in the DW/BI system
before deployment? - How and where should you perform that testing?
- Whats the easiest way to deploy the completely
new DW/BI system, and how can you safely test and
deploy changes to an existing system? - Why is DW/BI documentation important, and what
documentation does your team need to write? - What kind of support and training will your users
need?