How do we mine data - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

How do we mine data

Description:

It is considered as the heart of data mining. ... Values may be missing because of human error, because the information was not ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 38
Provided by: kictI
Category:
Tags: data | diagram | heart | human | mine

less

Transcript and Presenter's Notes

Title: How do we mine data


1
How do we mine data?
  • The process of data mining is described as a
    process of model building.
  • Five main steps to data mining
  • 1. Data Preparation
  • 2. Defining a study
  • Reading the data and building a model
  • Understanding the model
  • 3. Data Mining
  • 4. Analysis of Results.
  • 5. Assimilation of Knowledge.

2
Step(1) Data Preparation
  • It is considered as the heart of data mining.
  • Example if you want to find out who will respond
    to a direct marketing process, you need data
    about customers who have previously responded to
    mailer.
  • Example If you have their names and addresses,
    you should know that this type of information is
    unique to a customer and therefore not the best
    data to be mined!!
  • Information like city and state are descriptive
    information. Demographic information is more
    valuable such as age, income, interests,
    household type...

3
Data Preparation Issues
  • (1) Data Cleaning
  • Consistency Problem a column containing a list
    of soft drinks may have the values Pepsi,
    Coca Cola and Cola. These values refers to
    the same drink (soft drink) but they are not
    known to the computer as the same.
  • Stale Data Problem a database has to be
    continually updated, because people may move and
    their addresses change. An old address that is no
    longer correct is often referred to as stale.
  • Typographical errors words are frequently
    misspelled or typed incorrectly.

4
  • (2) Missing Values
  • Some data mining techniques require rows of data
    to be complete in order to mine the data. If too
    many values are missing in a data set, it becomes
    hard to extract any useful information or to make
    prediction.
  • (3) Data Derivation
  • Most interesting data may require derivation from
    existing columns, such as MAX, SUM functions.

5
  • (4) Merging Data
  • Data are stored in the form of tables. Merging
    data can be achieved in a number of ways such as
    SQL statements or export of the data into a file.

6
Effort Required for Each Data Mining Process Step
70 60 50 40 30 20 10
Effort
Defining a study Data Preparation Data
Mining Analysis of Results
and Knowledge Assimilation
7
The Data Mining Process Begins and Ends with the
Business Objectives
Selected Data Preprocessed Data
Transformed Data Extracted
Information Assimilated Knowledge
Database
Select Preprocess
Transform
Mine Analyze
and Assimilate
Data Mining Process is and Iterative Process
8
Example of Data on patient recovery form severe
back pain
9
Data Preparation
  • Getting at your data
  • It is not straightforward task if the data is
    stored in many places.
  • Example data about patients, doctors, hospital,
    insurance, ... May be stored in different
    databases.
  • Even if the data is in one relational database,
    the data is likely to be stored in multiple
    tables.

10
Ways to Access Data for Data Mining
  • Accessing Data Warehouses
  • Accessing Data through Relational (by creating a
    view on the database side, which is a way to
    make multiple tables appear as one).
  • Accessing Data through Conversion Utilities (if
    the data is stored in a different format than
    what the tool supports)
  • Accessing Data Using Query Tools (to join tables
    and create files).
  • Accessing Data from Flat Files (very fast to
    read, have to be created from somewhere,
    difficult to manipulate).

11
Data Preparation - Stage 1 - Data Selection
  • Goal Identify the available data sources and
    extract the data that is needed for preliminary
    analysis in preparation for further mining.
  • Data selection will vary with the business
    objectives.
  • With each of the selected variable, associated
    semantic information (metadata) is needed to
    understand what each of the variables means.
  • Metadata must include business definitions of
    the data, clear descriptions of data types,
    potential values, original sources system, data
    formats and other characteristics.

12
Types of Variables
  • (1) Categorical The possible values are finite
    and differ in kind.
  • (a) Nominal variables name the kind of object to
    which they refer. There is no order among the
    possible values. Examples martial status
    Married, single, divorced, unknown. Gender male,
    female. Educational level university , college,
    high school.
  • (b) Ordinal Variables have an order among the
    possible values. Example customer credit rating
    Good, regular, poor.
  • (2) Qualitative Measurable difference between
    the possible values.
  • (a) Continuous (real numbers). Income, average
    number of purchases.
  • (b) Discrete (Integers). Number of employees,
    time of year month, season, quarter).

13
  • Active Variables The variables selected for data
    mining are called active variables in the sense
    that they are actively used to distinguish
    segments, make predictions or perform some other
    data mining operations.
  • Supplementary variables these variables are not
    used in data mining analysis but are useful in
    helping to visualize and explain the results.

14
Example
  • From a database of 15,000 customers whose
    supermarket purchases had been tracked for three
    years.
  • From this database, only those who had purchased
    orange juice more than 25 times in the last three
    years were selected. The list of items purchased
    in each supermarket visit was called a basket.
  • And few variables were used to describe each
    basket householdID, date of purchase, basket
    contents, basket value, product quantity, and
    promotion ID...

15
Data Preparation - Stage 2 - Data Preprocessing
  • Goal to ensure the quality of the selected data.
  • Clean and well-understood data is a clear
    prerequisite for successful data mining.
  • The most problematic phase? Because most
    operational data has never been captured or
    modeled for data mining purposes.
  • It includes
  • a general review of the structure of the data and
  • some measuring of its quality using some
    statistical and visualization methods.
  • Representative sampling of the selected data is a
    useful technique as large data volumes would
    otherwise make the review process very time
    consuming.

16
  • Scatterplots is a graphical tool that represent
    the relationship between two or more continuous
    variables.

150k 120k 90k 60k 30k 0k
Income
20 40 60 Age
17
Boxplot diagrams is very useful for comparing the
center (average) or spread (deviation) of two or
more variables
150k 120k 90k 60k 30k 0k
Extreme Upper Value Upper Quartile (75) Median
Value Lower Quartile (25) Extreme Lower Value
Income
Women Men
18
Noisy Data
  • Outlier One or more variables have values that
    are significantly out of line with what is
    expected for those variables.
  • It gives us the maximum/minimum limits but at the
    same time it may be no more than invalid data.
  • One kind of outlier may be the result of a human
    error (Example Age 654 or negative income).
    Either to be corrected (if possible) or drooped
    from the analysis.
  • Another kind of outlier is created when changes
    in operational systems have not been reflected in
    the data mining environment. For example, new
    product codes introduced in operational systems
    show up initially as outliers. In this case you
    have to update the metadata.

19
Missing Values
  • Include values that are not present in the
    selected data and those invalid values that we
    may delete during noise detection.
  • Values may be missing because of human error,
    because the information was not available at the
    time of input or because the data was selected
    across heterogeneous sources.
  • One way is to eliminate the observations that
    have missing values. (Easy, but it has drawback
    of losing valuable data, especially if the data
    to be mined is small or if the fraud or quality
    control is the objective).
  • Another solution is to drop the whole variable
    from the the analysis.
  • Another solution is to replace the missing value
    with its most likely value. For quantitative
    variables, this most likely value could be the
    mean or mode.
  • For categorical variables it could be the mode or
    a newly created value for the variable called
    unknown.

20
  • A more sophisticated approach for both
    quantitative and categorical variables is to use
    a predictive model (will be discussed later) to
    predict the most likely value for a variable on
    the basis of the values of the other variables in
    the observation.

21
Data Qualification Issues
  • You would not mine a field like CustomerID,
    FirstName, LastName or Address because they are
    unique field and there are no patterns to find in
    unique fields.

22
Data Quality Issues - Examples
  • This study is supposed to show only one patient
    record for each patient.

The fact that 33 records for one patient means
that we have redundant data in this study and it
must be cleaned.
23
  • There are inconsistencies and misspelling in the
    value that should read 0-2 Weeks.

24
  • Simple graphical tools (histograms and pie
    charts) can quickly plot the contribution made by
    each value for the categorical variable and
    therefore help to identify distribution skews and
    invalid or missing values.
  • When dealing with quantitative variables the data
    analysts may interested in such measures as
    maxima and minima, mean (average), mode (most
    frequently occurring value), median (midpoint
    value), and several statistical measures of
    central tendency (tendency for values to cluster
    around the mean).

25
  • Variance in Defining Terms
  • You may ask what makes a person an occasional
    smoker versus a frequent smoker?
  • If two hospitals vary in their definition, your
    data is skewed. (Example five days a week or
    more than 12 times a week).

26
  • Skewed distributions often indicate outliers.
  • Example a histogram may show that most of the
    people in a target group have low incomes and
    only a few are high earners. It may refer to that
    they result from a poor data collection i.e. the
    group may consists mainly of retired people.

27
  • Data Preparation involves finding the answers to
    several questions, including
  • How do you create the table?
  • How do you mine data that is not in the right
    form?
  • How do you handle data that is not entirely clean?

28
Binning - Examples
  • The field was already binned before you mined it.
  • When you have fields that are a range of numbers,
    it is often best to bin them or define them in
    categories.
  • Most data mining tools will offer ways to bin
    data.
  • How many bins you should have? Depends on the
    data distribution.

29
Data Derivation - Examples
  • Two fields Weight and WeightLastYear. But
    another field that might be interesting for use
    in data mining, is to have a field that shows the
    difference in a patients weight, which can be
    derived by taking the difference between the two
    columns.
  • It can be derived using built in functions,
    SQL...
  • Deriving the name of a state from the area code.

30
Data Preparation - Stage 3 - Data Transformation
  • During data transformation, the preprocessed data
    is transformed to produce the analytical data
    model.
  • The techniques used can range from simple data
    format conversions to complex statistical data
    reduction tools. (from/to US, European formats,
    date of birth to age...).
  • Data reduction is another transformation
    techniques in which we combine several existing
    variable into one new variable. For example
    income, ZIP code and level of education together
    to find the attractiveness of the prospect.
  • Data reduction --gt smaller and more manageable
    set for further analysis but 1 it is not easy
    to determine which variable can be combined and
    2 combining variables will cause some loss of
    information and 3 the final result will be all
    the more difficult to interpret.

31
  • Many techniques (like Neural Network) can accept
    only numeric input in the 0.0 to 1.0 or -1.0 to
    1.0 range. In these cases, continuous parameter
    values must be scaled so that all have the same
    order of magnitude.
  • Discretization technique to convert quantitative
    variable into categorical variables by dividing
    the values of the input variable into buckets.
    (Income of 0-9999 --gt 1, 10000-19999 --gt
    2,...).
  • One-of-N transformation to convert a categoric
    variable to a numeric representation. (4 values
    for the variable TypeOfCar could be by 1000,
    0100, 0010 and 0001).

32
Step(2) Defining a Study (Business Objective
Determination)
  • To ensure that there is a real, critical business
    issues that is worth solving.
  • The only way to find out whether a data mining
    solution is really needed is to properly define
    the business objectives.
  • Ill-defined projects are not likely to succeed or
    result in added value.
  • It requires the collaboration of the business
    analyst with domain knowledge and the data
    analyst who can begin to translate the objectives
    into a data mining application.

33
Step (3) Data Mining
  • The objective is to apply the selected data
    mining algorithm or algorithms to the
    preprocessed data.
  • What happens during the data mining step will
    vary with the kind of application that is under
    development.
  • In the case of data segmentation, one or two runs
    of the algorithm may be sufficient to clear this
    step and move into analysis of the results.
  • In the case of developing a predictive model,
    there will be a cyclical process where the models
    are repeatedly trained and retrained on sample
    data before being tested against the real
    database.

34
  • One difficulty in predictive modeling is that of
    overtraining, where the model predicts well on
    the training data but performs poorly on the real
    test data. (i.e. the model learns the detailed
    patterns of that data but cannot generalize well
    when confronted with new observations from the
    test data set.

35
  • All results were extensively cross-validated
    using a technique that is sometimes called
    10-fold cross-validation. The entire database was
    divided into 10 equal parts. The models were then
    trained on only nine-tenths of the database and
    tested on the remaining one-tenth, which had been
    held out. This process was repeated until each of
    the other tenths had also been used for testing.

36
Step (4) Analysis of Results
  • After mining the data --gt Have we found something
    that is interesting, valid and actionable?
  • Data mining is different from traditional
    statistical analysis
  • With statistics, the answer is generally, yes or
    no (i.e. the hypothesis is correct or incorrect).
  • With data mining, if it is done well, the results
    either suggest the answer or at least point the
    team in the direction of another avenue of
    research.

37
Examples of rules output
  • If purchases OJ in large (gt12 oz) cans gt 58 of
    the time Then Predict Loyal
  • If primary brand is Brand X Then Predict
    Vulnerable.
  • If buys at warehouse stores gt 11 of the
    time Then Predict Vulnerable.
  • If buys gt 24.26 ounces per shopping trip on
    average AND average price per ounce gt
    0.10 Then Predict Loyal.
Write a Comment
User Comments (0)
About PowerShow.com