Introduction to SPSS Modeler (1) Data Preprocessing - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Introduction to SPSS Modeler (1) Data Preprocessing

Description:

Introduction to SPSS Modeler (1) Data Preprocessing Department of Computer and Information Science Fordham University Working with SPSS Modeler A three-step process ... – PowerPoint PPT presentation

Number of Views:2693

Avg rating:3.0/5.0

Slides: 44

Provided by: stormCisF

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to SPSS Modeler (1) Data Preprocessing

1
Introduction to SPSS Modeler (1)Data
Preprocessing

Department of Computer and Information Science
Fordham University

2
Working with SPSS Modeler

A three-step process of working with data.
Read data into SPSS Modeler.
Run the data through a series of manipulations.
Send the data to a destination.
This sequence of operations is known as a data
stream because the data flows record by record
from the source through each manipulation and,
finally, to the destination--either a model or
type of data output.

3
Streams, Outputs, Models Manager
Stream Canvas
Project Window
Palettes
Nodes
4
Stream Canvas

Streams are created by drawing diagrams of data
operations relevant to your problem on the main
canvas in the interface. Each operation is
represented by an icon or node, and the nodes are
linked together in a stream representing the flow
of data through each operation.
You can work with multiple streams at one time in
SPSS Modeler, either in the same stream canvas or
by opening a new stream canvas. During a session,
streams are stored in the Streams manager, at the
upper right of the SPSS Modeler window.

5
Nodes Palette

Most of the data and modeling tools in IBM SPSS
Modeler reside in the Nodes Palette, across the
bottom of the window below the stream canvas.
To add nodes to the canvas, double-click icons
from the Nodes Palette or drag and drop them onto
the canvas. You then connect them to create a
stream, representing the flow of data.

6
Modeler Managers

At the top right of the window is the managers
pane. This has three tabs.
Streams tab open, rename, save, and delete the
streams created in a session.
Outputs tab display, save, rename, and close
the tables, graphs, and reports, which are
produced by stream operations.
Models tab is the most powerful of the manager
tabs. This tab contains all model nuggets, which
contain the models generated in SPSS Modeler, for
the current session. These models can be browsed
directly from the Models tab or added to the
stream in the canvas.

7
Project Pane

On the lower right side of the window is the
project pane, used to create and manage data
mining projects (groups of files related to a
data mining task). There are two ways to view
projects you create
The Classes view
The CRISP-DM view (recommended)

8
Create a Stream

To build a stream that will create a model, we
need at least three elements
A source node that reads in data from some
external source.
A source or Type node that specifies field
properties, such as measurement level (the type
of data that the field contains), and the role of
each field as a target or input in modeling.
A modeling node that generates a model nugget
when the stream is run.

9
Source Nodes

Source nodes enable you to import data stored in
a number of formats, including flat files, IBM
SPSS Statistics (.sav), SAS, Microsoft Excel, and
ODBC-compliant relational databases. You can also
generate synthetic data using the User Input node.

10
Excel Source Node

The Excel source node enables you to import data
from any version of Microsoft Excel.

11
Excel Source Node

File type. Select the Excel file type that you
are importing.
Import file. Specifies the name and location of
the spreadsheet file to import.
Choose worksheet. Specifies the worksheet to
import, either by index or by name.
By index. Specify the index value for the
worksheet you want to import, beginning with 0
for the first worksheet, 1 for the second
worksheet, and so on.
By name. Specify the name of the worksheet you
want to import. Click the ellipses button (...)
to choose from the list of available worksheets.

12
Excel Source Node

Range on worksheet. You can import data beginning
with the first non-blank row or with an explicit
range of cells.
Range starts on first non-blank row. Locates the
first non-blank cell and uses this as the upper
left corner of the data range.
Explicit range of cells. Enables you to specify
an explicit range by row and column. For example,
to specify the Excel range A1D5, you can enter
A1 in the first field and D5 in the second (or
alternatively, R1C1 and R5C4). All rows in the
specified range are returned, including blank
rows.
On blank rows. If more than one blank row is
encountered, you can choose whether to Stop
reading, or choose Return blank rows to continue
reading all data to the end of the worksheet,
including blank rows.
First row has column names. Indicates that the
first row in the specified range should be used
as field (column) names. If not selected, field
names are generated automatically.

13
Type Node Field Ops

Field properties can be specified in a source
node or in a separate Type node. The
functionality is similar in both nodes.
Type node should be connected with the source
node.

14
Measurement Level

Measurement level (formerly known as "data type"
or "usage type") describes the usage of the data
fields. The measurement level can be specified on
the Types tab of a source or Type node. For
example, you may want to set the measurement
level for an integer field with values of 1 and 0
to Flag. This usually indicates that 1 True and
0 False.

15
Measurement Level

Default
Data whose storage type and values are unknown
(for example, because they have not yet been
read) are displayed as ltDefaultgt.
Continuous
Used to describe numeric values, such as a range
of 0100 or 0.751.25. A continuous value can be
an integer, real number, or date/time.
Categorical
Used for string values when an exact number of
distinct values is unknown. This is an
uninstantiated data type, meaning that all
possible information about the storage and usage
of the data is not yet known.
Once data have been read, the measurement level
will be Flag, Nominal, or Typeless, depending on
the maximum number of members for nominal fields
specified in the Stream Properties dialog box.

16
Measurement Level

Flag
Used for data with two distinct values that
indicate the presence or absence of a trait, such
as true and false, Yes and No or 0 and 1. The
values used may vary, but one must always be
designated as the "true" value, and the other as
the "false" value. Data may be represented as
text, integer, real number, date, time, or
timestamp.
Nominal
Used to describe data with multiple distinct
values, each treated as a member of a set, such
as small/medium/large. Nominal data can have any
storage numeric, string, or date/time.

17
Measurement Level

Ordinal
Used to describe data with multiple distinct
values that have an inherent order, e.g. salary
categories or satisfaction rankings.
The order is defined by the natural sort order of
the data elements, e.g. 1, 3, 5, while HIGH, LOW,
NORMAL (ascending alphabetically).
Typeless
Used for data that does not conform to any of the
above types, for fields with a single value, or
for nominal data where the set has more members
than the defined maximum. It is also useful for
cases in which the measurement level would
otherwise be a set with many members (such as an
account number).

18
Auto Data Prep Field Ops

Automated Data Preparation (ADP) handles the task
for preparing data for analysis.
analyzing your data
identifying fixes
screening out fields that are problematic or not
likely to be useful
deriving new attributes when appropriate,
and improving performance through intelligent
screening techniques.
Using ADP enables you to make your data ready for
model building quickly and easily, without
needing prior knowledge of the statistical
concepts involved.

19
Auto Data Prep Field Ops

when ADP prepares a field for analysis, it
creates a new field containing the adjustments or
transformations, rather than replacing the
existing values and properties of the old field.
The old field is not used in further analysis
its role is set to None.

20
Data Audit Node - Output

The Data Audit node provides a comprehensive
first look at the data you bring into IBM SPSS
Modeler, presented in an easy-to-read matrix that
can be sorted and used to generate full-size
graphs and a variety of data preparation nodes.

21
Perform Data Audit
22
Statistics and Charts
23
Data Quality

The Quality tab in the audit report displays
information about outliers, extremes, and missing
values.

24
Missing Values SuperNode

After specifying an impute method for one or more
fields, to generate a Missing Values SuperNode,
from the menus choose
Generate gt Missing Values SuperNode
Within the SuperNode, a combination of model
nugget, Filler, and Filter nodes is used as
appropriate. To understand how it works, you can
edit the SuperNode and click Zoom In, and you can
add, edit, or remove specific nodes within the
SuperNode to fine-tune the behavior.

25
(No Transcript)
26
Generate Filter Node

Alternatively, you can generate a Select or
Filter node to remove fields or records with
missing values. For example, you can filter any
fields with a quality percentage below a
specified threshold.

27
(No Transcript)
28
Outlier and Extreme SuperNode

Outliers and extreme values can be handled in a
similar manner. Specify the action you want to
take for each fieldeither coerce, discard, or
nullifyand generate a SuperNode to apply the
transformations.

29
Outliers and Extreme Values

Standard deviation from the mean
Detects outliers and extremes based on the number
of standard deviations from the mean. For
example, if you have a field with a mean of 100
and a standard deviation of 10, you could specify
3.0 to indicate that any value below 70 or above
130 should be treated as an outlier.
Interquartile range
Detects outliers and extremes based on the
interquartile range, which is the range within
which the two central quartiles fall (between the
25th and 75th percentiles). For example, based on
the default setting of 1.5, the lower threshold
for outliers would be Q1 1.5 IQR and the
upper threshold would be Q3 1.5IQR. Note that
using this option may slow performance on large
datasets.

30
Handling Outliers and Extreme Values

Action
Coerce
Replaces outliers and extreme values with the
nearest value that would not be considered
extreme.
Example if an outlier is defined to be anything
above or below three standard deviations, then
all outliers would be replaced with the highest
or lowest value within this range.
Discard
Discards records with outlying or extreme values
for the specified field.
Nullify
Replaces outliers and extremes with the null or
system-missing value.
Coerce outliers / discard extremes
Discards extreme values only.
Coerce outliers / nullify extremes
Nullifies extreme values only.

31
(No Transcript)
32
Zoom In SuperNode
33
Reset Field in Type Node
34
(No Transcript)
35
Excel Export Node

The Excel export node outputs data in Microsoft
Excel format (.xls). Optionally, you can choose
to automatically launch Excel and open the
exported file when the node is executed.

36
(No Transcript)
37
Binning Node Field Ops

The Binning node enables you to automatically
create new nominal fields based on the values of
one or more existing continuous (numeric range)
fields.

38
Binning Techniques

Using the Binning node, you can automatically
generate bins (categories) using the following
techniques
Fixed-width binning
Tiles (equal count or sum)
Mean and standard deviation
Ranks
Optimized relative to a categorical "supervisor"
field