Introduction to SPSS Modeler (1) Data Preprocessing - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Introduction to SPSS Modeler (1) Data Preprocessing

Description:

Introduction to SPSS Modeler (1) Data Preprocessing Department of Computer and Information Science Fordham University Working with SPSS Modeler A three-step process ... – PowerPoint PPT presentation

Number of Views:2693
Avg rating:3.0/5.0
Slides: 44
Provided by: stormCisF
Category:

less

Transcript and Presenter's Notes

Title: Introduction to SPSS Modeler (1) Data Preprocessing


1
Introduction to SPSS Modeler (1)Data
Preprocessing
  • Department of Computer and Information Science
  • Fordham University

2
Working with SPSS Modeler
  • A three-step process of working with data.
  • Read data into SPSS Modeler.
  • Run the data through a series of manipulations.
  • Send the data to a destination.
  • This sequence of operations is known as a data
    stream because the data flows record by record
    from the source through each manipulation and,
    finally, to the destination--either a model or
    type of data output.

3
Streams, Outputs, Models Manager
Stream Canvas
Project Window
Palettes
Nodes
4
Stream Canvas
  • Streams are created by drawing diagrams of data
    operations relevant to your problem on the main
    canvas in the interface. Each operation is
    represented by an icon or node, and the nodes are
    linked together in a stream representing the flow
    of data through each operation.
  • You can work with multiple streams at one time in
    SPSS Modeler, either in the same stream canvas or
    by opening a new stream canvas. During a session,
    streams are stored in the Streams manager, at the
    upper right of the SPSS Modeler window.

5
Nodes Palette
  • Most of the data and modeling tools in IBM SPSS
    Modeler reside in the Nodes Palette, across the
    bottom of the window below the stream canvas.
  • To add nodes to the canvas, double-click icons
    from the Nodes Palette or drag and drop them onto
    the canvas. You then connect them to create a
    stream, representing the flow of data.

6
Modeler Managers
  • At the top right of the window is the managers
    pane. This has three tabs.
  • Streams tab open, rename, save, and delete the
    streams created in a session.
  • Outputs tab display, save, rename, and close
    the tables, graphs, and reports, which are
    produced by stream operations.
  • Models tab is the most powerful of the manager
    tabs. This tab contains all model nuggets, which
    contain the models generated in SPSS Modeler, for
    the current session. These models can be browsed
    directly from the Models tab or added to the
    stream in the canvas.

7
Project Pane
  • On the lower right side of the window is the
    project pane, used to create and manage data
    mining projects (groups of files related to a
    data mining task). There are two ways to view
    projects you create
  • The Classes view
  • The CRISP-DM view (recommended)

8
Create a Stream
  • To build a stream that will create a model, we
    need at least three elements
  • A source node that reads in data from some
    external source.
  • A source or Type node that specifies field
    properties, such as measurement level (the type
    of data that the field contains), and the role of
    each field as a target or input in modeling.
  • A modeling node that generates a model nugget
    when the stream is run.

9
Source Nodes
  • Source nodes enable you to import data stored in
    a number of formats, including flat files, IBM
    SPSS Statistics (.sav), SAS, Microsoft Excel, and
    ODBC-compliant relational databases. You can also
    generate synthetic data using the User Input node.

10
Excel Source Node
  • The Excel source node enables you to import data
    from any version of Microsoft Excel.

11
Excel Source Node
  • File type. Select the Excel file type that you
    are importing.
  • Import file. Specifies the name and location of
    the spreadsheet file to import.
  • Choose worksheet. Specifies the worksheet to
    import, either by index or by name.
  • By index. Specify the index value for the
    worksheet you want to import, beginning with 0
    for the first worksheet, 1 for the second
    worksheet, and so on.
  • By name. Specify the name of the worksheet you
    want to import. Click the ellipses button (...)
    to choose from the list of available worksheets.

12
Excel Source Node
  • Range on worksheet. You can import data beginning
    with the first non-blank row or with an explicit
    range of cells.
  • Range starts on first non-blank row. Locates the
    first non-blank cell and uses this as the upper
    left corner of the data range.
  • Explicit range of cells. Enables you to specify
    an explicit range by row and column. For example,
    to specify the Excel range A1D5, you can enter
    A1 in the first field and D5 in the second (or
    alternatively, R1C1 and R5C4). All rows in the
    specified range are returned, including blank
    rows.
  • On blank rows. If more than one blank row is
    encountered, you can choose whether to Stop
    reading, or choose Return blank rows to continue
    reading all data to the end of the worksheet,
    including blank rows.
  • First row has column names. Indicates that the
    first row in the specified range should be used
    as field (column) names. If not selected, field
    names are generated automatically.

13
Type Node Field Ops
  • Field properties can be specified in a source
    node or in a separate Type node. The
    functionality is similar in both nodes.
  • Type node should be connected with the source
    node.

14
Measurement Level
  • Measurement level (formerly known as "data type"
    or "usage type") describes the usage of the data
    fields. The measurement level can be specified on
    the Types tab of a source or Type node. For
    example, you may want to set the measurement
    level for an integer field with values of 1 and 0
    to Flag. This usually indicates that 1 True and
    0 False.

15
Measurement Level
  • Default
  • Data whose storage type and values are unknown
    (for example, because they have not yet been
    read) are displayed as ltDefaultgt.
  • Continuous
  • Used to describe numeric values, such as a range
    of 0100 or 0.751.25. A continuous value can be
    an integer, real number, or date/time.
  • Categorical
  • Used for string values when an exact number of
    distinct values is unknown. This is an
    uninstantiated data type, meaning that all
    possible information about the storage and usage
    of the data is not yet known.
  • Once data have been read, the measurement level
    will be Flag, Nominal, or Typeless, depending on
    the maximum number of members for nominal fields
    specified in the Stream Properties dialog box.

16
Measurement Level
  • Flag
  • Used for data with two distinct values that
    indicate the presence or absence of a trait, such
    as true and false, Yes and No or 0 and 1. The
    values used may vary, but one must always be
    designated as the "true" value, and the other as
    the "false" value. Data may be represented as
    text, integer, real number, date, time, or
    timestamp.
  • Nominal
  • Used to describe data with multiple distinct
    values, each treated as a member of a set, such
    as small/medium/large. Nominal data can have any
    storage numeric, string, or date/time.

17
Measurement Level
  • Ordinal
  • Used to describe data with multiple distinct
    values that have an inherent order, e.g. salary
    categories or satisfaction rankings.
  • The order is defined by the natural sort order of
    the data elements, e.g. 1, 3, 5, while HIGH, LOW,
    NORMAL (ascending alphabetically).
  • Typeless
  • Used for data that does not conform to any of the
    above types, for fields with a single value, or
    for nominal data where the set has more members
    than the defined maximum. It is also useful for
    cases in which the measurement level would
    otherwise be a set with many members (such as an
    account number).

18
Auto Data Prep Field Ops
  • Automated Data Preparation (ADP) handles the task
    for preparing data for analysis.
  • analyzing your data
  • identifying fixes
  • screening out fields that are problematic or not
    likely to be useful
  • deriving new attributes when appropriate,
  • and improving performance through intelligent
    screening techniques.
  • Using ADP enables you to make your data ready for
    model building quickly and easily, without
    needing prior knowledge of the statistical
    concepts involved.

19
Auto Data Prep Field Ops
  • when ADP prepares a field for analysis, it
    creates a new field containing the adjustments or
    transformations, rather than replacing the
    existing values and properties of the old field.
    The old field is not used in further analysis
    its role is set to None.

20
Data Audit Node - Output
  • The Data Audit node provides a comprehensive
    first look at the data you bring into IBM SPSS
    Modeler, presented in an easy-to-read matrix that
    can be sorted and used to generate full-size
    graphs and a variety of data preparation nodes.

21
Perform Data Audit
22
Statistics and Charts
23
Data Quality
  • The Quality tab in the audit report displays
    information about outliers, extremes, and missing
    values.

24
Missing Values SuperNode
  • After specifying an impute method for one or more
    fields, to generate a Missing Values SuperNode,
    from the menus choose
  • Generate gt Missing Values SuperNode
  • Within the SuperNode, a combination of model
    nugget, Filler, and Filter nodes is used as
    appropriate. To understand how it works, you can
    edit the SuperNode and click Zoom In, and you can
    add, edit, or remove specific nodes within the
    SuperNode to fine-tune the behavior.

25
(No Transcript)
26
Generate Filter Node
  • Alternatively, you can generate a Select or
    Filter node to remove fields or records with
    missing values. For example, you can filter any
    fields with a quality percentage below a
    specified threshold.

27
(No Transcript)
28
Outlier and Extreme SuperNode
  • Outliers and extreme values can be handled in a
    similar manner. Specify the action you want to
    take for each fieldeither coerce, discard, or
    nullifyand generate a SuperNode to apply the
    transformations.

29
Outliers and Extreme Values
  • Standard deviation from the mean
  • Detects outliers and extremes based on the number
    of standard deviations from the mean. For
    example, if you have a field with a mean of 100
    and a standard deviation of 10, you could specify
    3.0 to indicate that any value below 70 or above
    130 should be treated as an outlier.
  • Interquartile range
  • Detects outliers and extremes based on the
    interquartile range, which is the range within
    which the two central quartiles fall (between the
    25th and 75th percentiles). For example, based on
    the default setting of 1.5, the lower threshold
    for outliers would be Q1 1.5 IQR and the
    upper threshold would be Q3 1.5IQR. Note that
    using this option may slow performance on large
    datasets.

30
Handling Outliers and Extreme Values
  • Action
  • Coerce
  • Replaces outliers and extreme values with the
    nearest value that would not be considered
    extreme.
  • Example if an outlier is defined to be anything
    above or below three standard deviations, then
    all outliers would be replaced with the highest
    or lowest value within this range.
  • Discard
  • Discards records with outlying or extreme values
    for the specified field.
  • Nullify
  • Replaces outliers and extremes with the null or
    system-missing value.
  • Coerce outliers / discard extremes
  • Discards extreme values only.
  • Coerce outliers / nullify extremes
  • Nullifies extreme values only.

31
(No Transcript)
32
Zoom In SuperNode
33
Reset Field in Type Node
34
(No Transcript)
35
Excel Export Node
  • The Excel export node outputs data in Microsoft
    Excel format (.xls). Optionally, you can choose
    to automatically launch Excel and open the
    exported file when the node is executed.

36
(No Transcript)
37
Binning Node Field Ops
  • The Binning node enables you to automatically
    create new nominal fields based on the values of
    one or more existing continuous (numeric range)
    fields.

38
Binning Techniques
  • Using the Binning node, you can automatically
    generate bins (categories) using the following
    techniques
  • Fixed-width binning
  • Tiles (equal count or sum)
  • Mean and standard deviation
  • Ranks
  • Optimized relative to a categorical "supervisor"
    field

39
Select Bin Fields
40
Bin Values
41
Binning Result
42
Filter Node Field Ops
  • You can rename or exclude fields at any point in
    a stream with a filter node.

43
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com