Data%20Mining%20and%20Knowledge%20Acquizition%20%20 - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Data%20Mining%20and%20Knowledge%20Acquizition%20%20

Description:

Data Mining and Knowledge Acquizition Chapter 7 Data Mining Overview and Exam Questions 2014/2015 Summer Outline Methodology - Overview Introduction ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 113
Provided by: Jiaw266
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data%20Mining%20and%20Knowledge%20Acquizition%20%20


1
Data Mining and Knowledge Acquizition Chapter
7 Data Mining Overviewand Exam Questions
  • 2014/2015 Summer

2
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

3
Methodology and Overview
  • KDD Methodology
  • Functionalities

4
KDD Methodology
  • Methodology
  • Problem definition
  • Data set selection
  • Preprocessing transformations
  • Functionalities
  • Classification/numerical prediction
  • Clustering
  • Frequent Pattern Mining
  • Association
  • Sequential analysis
  • others

5
KDD Methodology (cont.)
  • Algorithms
  • For classification you can use
  • Decision trees ID3,C4.5 CHAID are algorithms
  • For clustering you can use
  • Partitioning methods k-means,k-medoids
  • Hierarchical AGNES
  • Probabilistic EM is an algorithm
  • Presenting results
  • Back transformations
  • Reports
  • Taking action

6
Data Description
  • Single variables
  • Categorical - Ordinal, nominal
  • Frequency plots, tables, Pie charts
  • Continuous interval, ratio
  • 5-summary, centeral tendency, spread
  • Examine the probability distribution
  • For two variables
  • Both categorical
  • Cross tabulation
  • One categorical the other continuous
  • Both are continuous
  • correlation coeficient, scatter plots

7
Preprocessing
  • Missing values
  • Inconsistencies
  • Redundent data
  • Outliers
  • Data transformations
  • Data reduction
  • Attribute elimination
  • Attribute combination
  • Samplinng
  • Histograms

8
Functionalities
  • Styles of Data Mining
  • Descriptive - OLAP
  • Classification
  • Numerical Prediction
  • Clustering
  • Frequent Pattern Mining

9
Two basic style of data mining
  • Descriptive
  • Cross tabulations,OLAP,attribute oriented
    induction,clustering,association
  • Predictive
  • Classification,numerical prediction
  • Difference between classification and numerical
    prediction
  • Questions answered by these styles
  • Supervised v.s. Unsupervised

10
Descriptive - OLAP
  • Concept of data cube
  • Fact table
  • Measures calculated measures
  • Keys
  • Dimensions
  • Sheams
  • Star, snowflake
  • Concept hierarchies
  • Set grouping such as price age
  • Parent child
  • Attributes not suitable for concept hierarcies

11
Classification
  • Methods
  • Decision trees
  • Neureal networks
  • Bayesian
  • K-NN or model based reasoning
  • Adventages disadventages
  • Given a problem which data processing techniques
    are required
  • Given a problem shich classification method or
    algorithm is more apprpriate

12
Classification (cnt.d)
  • Accuracy of the model
  • Measures for classification/numerical prediction
  • How to better estimate
  • Holdout,cross validation, bootstraping
  • How to improve
  • Bagging, boosting
  • For unbalanced classes
  • What to do with models
  • Lift charts

13
Numercal Prediction
  • Learning is supervised
  • Output variable is continuous
  • Methods
  • Regression
  • Simple
  • Multiple
  • Most methods for classification can be used for
    numerical prediction as well
  • Accuricy
  • Root mean square, absolute mean deviation

14
Clustering
  • Distance measures
  • Dissimilarity or similarity
  • For different type of variables
  • Ordinal,binary,nominal,ratio,interval
  • Why need to transform data
  • Partitioning methods
  • K-means,k-medoids
  • Adventage disadventage
  • Hierarchical
  • Density based
  • probablistic

15
Frequent Pattern Mining
  • Association analysis
  • Apriori or FP-Growth
  • How to measure strongness of rules
  • Support and confidence
  • Other measures of interestingness critique of
    support confidence
  • Multiple levels
  • Constraints
  • Sequential pattern mining

16
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

17
Introduction
  • Defineing problems
  • Given a short description of an environment,
    deine data mining problems fiting to different
    functionalities, possible preprocessing problems
    paciliur to the environment
  • Basic functionalities
  • Given a short description of a data mining
    problem, with which functionality the problem is
    solved?

18
Big University Library
  • 1. Suppose that a data warehouse for
    Big-University Library consists of the following
    three dimensions users, books, time, and each
    dimension has four levels not including the all
    level. There are three measures You are asked to
    perform a data mining study on that warehouse (25
    pnt)
  • Define three data mining problems on that
    warehouse involving association, classification
    and clustering functionalities respectively.
    Clearly state the importance of each problem.
    What is the advantage of the data being organized
    as OLAP cubes compared to relational table
    organisation?

19
Big University Library (cont.)
  • In data preprocessing stage of the KDD
  • What are the reasons for missing values? and How
    do you handle them?
  • what are possible data inconsistencies
  • do you make any discritization
  • do you make any data transformations
  • do you apply any data reduction strategies

20
Big University Library (cont.)
  • Define your target and input variables in
    classification. Which classification techniques
    and algorithms do you use in solving the
    classification problem? Support your answer
  • Define your variables indicating their categories
    in clustering Which clustering techniques and
    algorithms do you use in solving the clustering
    problem? Support your answer.
  • Describe association task in detail specifying
    the algorithm interestingness measures or
    constraints if any.

21
Data mining on MIS
  • A data warehouse for the MIS department consists
    of the following four dimensions student,
    course, instructor, semester and each dimension
    has five levels including the all level. There
    are two measures count and average grade. At the
    lowest level of average grade is the actual grade
    of a student. You are asked to perform a data
    mining study on that warehouse (25 pnt)

22
Data mining on MIS (cont.)
  • Define three data mining problems on that
    warehouse involving association, classification
    and clustering functionalities respectively.
    Clearly state the importance of each problem.
    What is the advantage of the data being organized
    as OLAP cubes compared to relational table
    organisation?
  • In data preprocessing stage of the KDD
  • What are the reasons for missing values? and How
    do you handle them?
  • what are possible data inconsistencies
  • do you make any discritization
  • do you make any data transformations
  • do you apply any data reduction strategies

23
Data mining on MIS (cont.)
  • Define your target and input variables in
    classification. Which classification techniques
    and algorithms do you use in solving the
    classification problem? Support your answer
  • Define your variables indicating their categories
    in clustering Which clustering techniques and
    algorithms do you use in solving the clustering
    problem? Support your answer.
  • Describe association task in detail specifying
    the algorithm interestingness measures or
    constraints if any.

24
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

25
Data Description
  • How to describe single variables categorical
    and continuous
  • How to desribe two association between two
    variables
  • bnoth continuous
  • both categorical
  • One continous, one categorical

26
Preprocessing
  • What to do as preprocessing?
  • Which techniques are applied?
  • For what reason?

27
MIS 542 Midterm 2011/2012 Fall PCA
  • 5. (10 points) Consider two continuous variables
    X and Y. Generate data sets
  • a) where PCA (principle component analysis) can
    not reduces the dimensionality from two to one
  • b) where although the two variables are related
    (a functional relationship exists between these
    two variables), PCA is not able to reduce the
    dimensionality from two to one

28
MIS 542 Final 2011/2012 Falloutliers
  • 1 (20 points) Give two examples of outliers.
  • a) Where outliers are useful and essential
    patterns to be mined.
  • b) Outliers are useless steaming from error or
    noise.

29
MIS 542 Final 2011/2012 Fall transformations
  • 2 (20 points) Considering the classification
    methods we cover in class, describe two distinct
    reasons why continuous input variables have to be
    normalized for classification problems(each
    reason 10 points).

30
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

31
OLAP
  • Concept of data cube
  • Fact table
  • Measures calculated measures
  • Keys
  • Dimensions
  • Sheams
  • Star, snowflake
  • Concept hierarchies
  • Set grouping such as price age
  • Parent child
  • Attributes not suitable for concept hierarcies

32
Data warehouse for library
  • A data warehouse is constructed for the library
    of a university to be used as a multi-purpose
    DSS. Suppose this warehouse consists of the
    following dimensions user , books , time
    (time_ID, year, quarter, month, week, academic
    year, semester, day), and . Week is
    considered not to be less than month. Each
    academic semester starts and ends at the
    beginning and end of a week respectively. Hence,
    weekltsemester.
  • Describe concept hierarchies for the three
    dimensions. Construct meaningfull attributes for
    each dimension tables above . Describe at least
    two meaningfull measures in the fact table.
    Each dimension can be looked at its ALL level as
    well.
  • What is the total number of cuboids for the
    library cube?
  • Describe three meaningfull OLAP queries and write
    sql expresions for one of them.

33
Big University
  • 2. (Han page 100,2.4) Suppose that the data
    warehouse for the Big-University consists of the
    following dimensions student,course,instructor,se
    mester and two measures count and average_grade.
    Where at the lowset conceptual level (for a given
    student, instructor,course, and semester) the
    average grade measure stores teh actual grade of
    the student. At higher conceptual levels the
    average_grade stores the average grade for the
    given combination. (when student is MIS semester
    2005 all terms, course MIS 541, instructor Ahmet
    Ak, average_grade is the average of students
    grades in thet course by that instructer in all
    semester in 2005)

34
Big University (cont.)
  • a) draw a snawflake sheam diagram for that
    warehouse
  • What are the concept hierarchys for the
    dimensions
  • b) What is the total nmber of cuboids

35
MIS 542 Final 2005/2006 Spring olap
  • 1. MIS department wants to revise academic
    strategies for the following ten years. Relevent
  • questions are What portion of the courese are
    required or elective? What is the full time part
  • time distribution of instuctors? What is the
    course load of instructors? What percent of
  • technical or managerial courses are thought by
    part time instructors? How all theses things

36
MIS 542 Final S06 1 cont.
  • changed over years? You can add similar stategic
    quustions of your own. Do not conside
  • students aspects of the problem for the time
    being. Desing and OLAP sheam to be used as a
  • strategic tool. You are free to decide the
    dimensions and the fact table. Describe the
    concept
  • hierarchies, virtual dimensions and calculated
    members. Finally show OLAP opperations to
  • answer three of such strategic questions

37
MIS 54 Final 2012/2013 Hospital
  • 2. (20 pts) Suppose that a data warehouse for a
    hospital consists of the following dimensions
    time, doctor and patient and the two measures
    count and charge, where charge is the fee a
    doctor charge a patient for a visit.
  • Design a warehouse with star schema
  • a) Fact table Design the fact table.
  • b) Dimension tables For each dimension show a
    reasonable concept hierarchy.
  • c) State two questions that can be answered by
    that OLAP cube.
  • d) Show drilldown and roll up operations related
    to one of these questions

38
Humman Resource cube
  • 1. (25 points) In an organization a data
    warehouse is to be designed for evaluating
    performance of employees. To evaluate performance
    of an employee, survey questionnaire is
    consisting a set of questions with 5 Likered
    scale are answered by other employees in the same
    company at specified times. That is, performance
    of employees are rated by other employees.
  • Each employee has a set of characteristics
    including department, education, Each survey is
    conducted at a particular date applied to some of
    the employees. Questions are aimed to evaluate
    broad categories of performance such as
    motivation, cooperation ability,
  • Typically, a question in a survey, aiming to
    measure a specific attitude about an employee is
    evaluated by another employee (rated f rom 1 to
    5) Data is available at question level.

39
Human resource cube (cont.)
  • Cube design a star schema
  • Fact table Design the fact table should contain
    one calculated member. What are the measures and
    keys?
  • Dimension tables Employee, and Time are the two
    essential dimensions include a Survey and
    Question dimensions as well. For each dimension
    show a concept hierarchy.
  • State three questions that can be answered by
    that OLAP cube.
  • Show drilldown and role up operations related to
    these questions

40
MIS Midterm 2008/2009 Spring Shipment
  • 1. (20 points) Consider a shipment company
    responsible for shipping items from one location
    to another on predetermined due dates. Design a
    star schema OLAP cube for this problem to be used
    by managers for decision making purposes. The
    dimensions are time, item to be shipped, person
    responsible for shipping the item, location.. For
    each of these dimensions determine three levels
    in the concept hierarchy. Design the fact table
    with appropriate measuresand keys (include two
    measure and at least one calculated member in the
    fact table)
  • Show one drilldown and role up operations
  • Show the SQL query of one of the cuboids.

41
Outline
  • Clustering

42
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

43
Comparing clustering methods
  • Clustering methods
  • Partitioning, hierarchical, density based,
    model-based probabnlistic EM
  • Compare clustering methods
  • Output
  • Interpreteation
  • Sensitivity ot aoutliers
  • Speed of computation

44
clustering
  • Construct simple data sets showing the
    inadequacies of k-means clustering (20 pnt)
  • this algorithm is not suitable of even spherical
    clusters of different sizes
  • What are the adventages and disadventage of using
    k-means

45
clustering
  • Consider a delivery center location decision
    problem in a city where a set of related products
    are to be delivered to markets located in the
    city. Design an algortihm for this lacation
    selection problem extending an algortihm we cover
    in class. State clearly the algorithm and its
    extensions.for this particular problem.

46
Clustering preferences
  • Consider a popular song competition. There are N
    competitors A1, A2, AN. Number of voters is
    very large a substantial fraction of the
    population of the country. Each voter is able to
    rank the competitors form best to worst e.g. for
    voter 1 (A4gtA2gtA3gtA1) meaning that there are four
    competitors and A4 is the best for voter 1 A1
    being the worst. Suppose preference data is
    available for a sample of n voters at the
    beginning of competition.
  • Develop a distance measure between the
    preferences of two voters i and j
  • Suppose you have the k-means algorithm available
    in a package. Describe how you can use the
    k-means algorithm to clusters voters according
    to their preferences.

47
MIS 542 Final 2005/2006 Spring
  • 3. a) Describe how to modify k-means algorithm so
    as to handle categorical variables (binary,
    ordinal, nominal).
  • b) What is a disadventage of Agglomerative
    hierarchical clustering method in the case of
    large data. Suggest a way of eliminating this
    disadventages while benefiting the adventages of
    agglomerative methods

48
MIS 542 Midterm 2007/2008 Spring
  • Generate data set of two continuous variables X
    and Y. Consider clustering based on density
  • When clustered with one variable there (either X
    or Y) there is one cluster
  • When clustered with both variable there there are
    two clusters

49
MIS 542 Final 2011/2012 Fall
  • 3 a (10 points) Generate data sets for two
    clustering problems with two continuous
    variables. Two natural clusters for the notion of
    density based clustering but the quality of
    these clusters are low for a partitioning
    approach based on dissimilarity such as k-means
  • 3.b (10 points) Considering the advantages and
    disadvantages of partitioning and hierarchical
    agglomerative clustering approaches. Design a
    method for combining the two approaches to
    improve good clustering quality. (Finally there
    are hierarchies of clusters)

50
MIS Midterm 2011/2012 Fall
  • 6. (25 points) A retail company asked to segment
    its customers. Following variables are available
    for each customer age, income, gender number of
    children, occupation, house owner, have a car or
    not. There are 6 category of goods sold by the
    company and total purchases from each category is
    available for each customer, in addition average
  • inter-purchase time is also included in the
    database.

51
MIS Midterm 2011/2012 Fall
  • a) What are the types and scales of these
    variables?
  • b) If your tool has only k-means algorithm which
    of these variables are more suitable for the
    segmentation problem?
  • c) What data transformations are to be applied?
  • d) How do you reduce number of variables used in
    the analysis?
  • e) If you want to include categorical variables
    into your clustering, how would you treat them?

52
Midterm 2011/2012 Fall
  • In Question 3-5 artificial data sets are
    generated for given situations.
  • 3. (10 points) Consider a data set of two
    continuous variables X and Y. There are two
    clusters (k2)
  • Considering the advantages and disadvantages of
    partitioning methods k-means and k-medoids of
    clustering, generate two dimensional data set
  • a) (5 pnt) Produces almost the same clusters by
    k-medoids and k-means
  • b) (5 pnt) Produces different clusters by
    k-medoids and k-means

53
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

54
Outline
  • Classification
  • General
  • Decision trees
  • Neural networks
  • Bayesian
  • K-NN
  • Accuricy Measures

55
Information gain
  • Consider a data set of two attributes A and B. A
    is continuous, whereas B is categorical, having
    two values as y and n, which can be
    considered as class of each observation. When
    attribute A is discretized into two equiwidth
    intervals no information is provided by the class
    attribute B but when discretized into three
    equiwidth intervals there is perfect information
    provided by B. Construct a simple dataset obeying
    these characteristics.

56
Decision tree
  • 2. a-Construct a data set that generates the tree
    shown below In addition the following conditions
    are satisfied

57
MIS 541 2012/2013 Final
  • 1. (20 pts) Consider a decision tree with only
    two branches in that the attribute selection
    measure is entropy. Bearing in mind that each
    candidate input attribute may have more then two
    distinct values, how do you modify the ID3
    algorithm to handle such a constraint on the
    number of branches of the tree.

58
MIS 542 Final 2005/2006 Spring
  • 2. Given the training data set with missing
    values
  • A(Size) B(color) C(shape) Class
  • small yellow round A
  • big yellow round A
  • big yellow red A
  • small red round A
  • small black round B
  • big black cube B
  • big yellow cube B
  • big black round B
  • small yellow cube B

59
MIS 542 Final 2005/2006 Spring (cont.)
  • a) Apply the C4.5 algorithm to construct a
    decision tree.
  • b) Given the new inputs Xsize small,color
    missing, shaperound.and Ysize big,color
    yellow, shapemissing What is the prediction of
    the tree for X and Y?
  • c) How do you classify the new data points given
    in part b) using Bayesian Classification?
  • d) Analyse the possibility of pruning the tree.
    You can make normal approximation to Binomial
    distribution though number of observations is
    low. z value for upper confidence limit of c25
    is 0.69.

60
MIS 542 Final S06 neural networks
  • 4. Consider a classification problem with two
    classes as C1 and C2. There are two numerical
    input variables X1 and X2, taking values
    between 0 and infinity. All observations are of
    class C1, if they are above X2 1/X1 curve (a
    hyperbola) All other observations are class C2.
    Describe how multilayer perceptrons can separate
    such a boundary using as few hidden nodes as
    possible.

61
MIS 542 Midterm S08 2 csass,f,cat,pm
  • Consider a clasification problem with two
    continuous variables X and Y and a categorical
    output with two distinct values C1 and C2
  • Generate data set such that
  • A) Decision trees are appropriate for
    clasification
  • B) Decision trees are not appropriate for
    clasification but a perceptron can classify the
    data succesfully
  • C) Even s single perceptron is not enough to
    classify the data
  • D) How do you encorporate a perceptron into
    decision trees so that cases in B and C can be
    clasified by an hybrid approach of DTs and
    perceptron

62
Final 2010/2011 Spring
  • 2 (30 pt.) Consider a prediction problem e.g.
    predicting weight using height(a continuous
    variable) as input, solved by neural networks.
    Such methods as back propagation try to minimize
    the prediction error but it is claimed that the
    magnitude of error depends on the weight a
    prediction error of 0.5 for a baby with a short
    height should not be the same as for an adult
    with a height of 2.00 meters.
  • a) Make a scatter plot of such a hypothetical
    data set for a two variable problem.
  • b) Plot the prediction error on another graph
  • c) Do you need to modify the back propagation
    algorithm so as to handle such a situation? If so
    explain your modification.

63
Final 2011/2012 Fall pverf,tt,mg
  • 4. Illustrate the over fitting of neural networks
    for the following cases by generating data sets.
  • a) (10 points) For a binary classification
    problem with two continuous inputs.
  • b) (10 points) For a numerical prediction problem
    (output being continuous) with one continuous
    input variable.

64
Midterm 2011/2012 Fall
  • 4. (10 points) Consider a classification by a
    decision tree problem. Consider a categorical
    input variable A having two distinct values. The
    output variable B has two distinct classes as
    well. At a particular node of the tree there are
    N data objects. Generate partitioning of data by
    input variable A for the following
  • a) A does not provide any information does not
    decrease information gain at all.
  • b) A does provides perfect information decrease
    information gain as much as possible

65
MIS 541 2012/2013 Final
  • 5. (20 pts) Consider a classification problem
    solved by k-NN. Suppose in your dataset all
    inputs are continuous variables. Why do you need
    to apply any data transformations? What data
    transformation is applied? Suppose the variables
    are to be weighted after transformations. Device
    a method for determining optimal weights for
    variables s well as determining optimal k value
    considering that k-NN is a supervised learning
    method.

66
MIS 541 2012/2013 Final
  • 5..(20 pts) The follwing table consists of
    training data from an employee database.
  • Predicted variable is status. Age,Salary and
    Department are inputs
  • Design a multilayer feedforward neural network
    for the given data. Label the noedes in the
    input, hidden and output layers. Describe how
    you encode the input and output variables,
    specifiy the parameters of the network that can
    be changed by the backpropegation algorithm.

67
Department Status Age Salary
Sales Senior 31-35 46K-50K
Sales Junior 26-30 26K-30K
Sales Junior 31-35 31K-35K
Systems Junior 21-25 46K-50K
Systems Senior 31-35 66K-70K
Systems Junior 26-30 46K-50K
Systems Senior 41-45 66K-70K
Marketing Senior 36-40 46K-50K
Marketing Junior 31-35 41K-45K
Secretary Senior 46-50 36K-40K
Secretary Junior 26-30 26K-30K
68
Accuracy measures
  • For class balanjcy or unbalancy problems
  • Output variables with ordinary scale
  • How do you modify the accuricy measure for an
    ordinal output variable with three different
    values
  • Give an example for such a variable

69
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

70
BIS 541 2012/2013 Final II
  • 5. Based on a sample of 30 observations the
    population regression model
  • Y i ?0 ?1x i ?i
  • The least square estimates of intercept is 10.0
  • Sum of the values of dependent and independent
    variables are 450 and 150 respectively.
  • Estimated variance of dependent variable is 25,
    variance of the residuals is 4
  • a) What is the least square estimate of slope
    coefficient? Interpret the figure.
  • b) What are the values of SSR and SSE?
  • c) Find and interpret the coefficient of
    determination.
  • d) Test the null hypothesis that the explanatory
    variable X does not have a significant effect on
    Y at confidence level of 95.Critical value of
    F?0.05(1,28) 4.20

71
BIS 541 2013/2014 Final
  • 4. Based on a sample of 50 observations the
    population regression model to predict number of
    automobile sales (dependent variable) based on
    advertisement placements (independent variable)
  • Y i ?0 ?1x i ?i
  • The least square estimates of slope is 2.0
  • Average of the values of independent variable is
    50. Sum of the values of dependent variable is
    5390.
  • Total sum of squares for dependent variable is
    9000 Variance of the residuals is 40

72
BIS 541 2013/2014 Final
  • a) What is the least square estimate of intercept
    coefficient? Interpret the figure.
  • b) Interpret the the slope coefficient.
  • b) What are the values of SSR and SSE?
  • c) Find and interpret the coefficient of
    determination.

73
MIS 214 Midterm 2012/2015 Summer
  • 5. (20 pt) An analyst want to estimate
    dependence of quantity demanded of a product (Y)
    on its price (X1) and price of its substitute
    (X2) using linear regression, based on a large
    sample of data obtained from 50 weeks
  • Fill the missing parts in the following
    regression outputs (From a to l this letter l)
  • Do not report the s but you may need their
    values.
  • Do not write on this table
  • R-square f
  • Adjusted R-square g
  • Standard error of regression h
  • SS d.f. MS F p-value
  • Regression a c d e
  • Error b d 2.5
  • Total 400 e

74
MIS 214 Final 2013/2014 Spring
  • 1 (20 pt) For the following four scenarios, each
    having two cases denoted by I and II, draw
    scatter plots of X (explanatory variable) and Y
    (dependent variable) showing the population
    regression model drawn as a line or curve as
    well. Use around 20-25 hypothetical points unless
    otherwise stated assumptions of least square are
    hold. In I and II population slope and intercepts
    are the same
  • a) In II variance of the error is higher than
    in I.
  • b) In II coefficient of determination is
    higher than in I.
  • c) In II spread of X is higher than in I.
  • d) In II variance of the error term increases
    with higher values of X.. In I, variance of error
    is homoscedastic.

75
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

76
Exercise
  • a) Suppose A ? B and B ? C are strong rules
  • Dose this imply that A ? C is also a strong rule?
  • b) Suppose A ? C and B ? C are strong rules
  • Dose this imply that A AND B ? C is also a strong
    rule?
  • c) Suppose A ? B and A ? C are strong rules
  • Dose this imply that A ? B AND C is also a
    strong?
  • d) Suppose A ? B AND C is a strong rule. Dose
    this imply that A ? B and A ? C are strong rules?
  • e) Suppose A AND B ? C is a strong rule. Dose
    this imply that A ? C and B ? C are strong rules?

77
Exercise
  • a) Suppose A,B,C is a frequent 3 itemset. Dose
    it imply that A,B and A,C are frequent 2
    itemsets?
  • b) Suppose A,B, A,C, and B,C are frequent 2
    itemsets. Dose it imply that A,B,C is a
    frequent 3 itemset?
  • c) Suppose A,B is a frequent 2 itemset. Dose it
    imply that, A ? B and B ? A are strong rules?

78
Associations
  • In a particular database A?C and B?C are
    strong association rules based on the support
    confidence measure. A and B are independent
    items. Does this imply that A ? B?C is
    also a strong rule based on the lift measure?
    A,B,C are items in a transaction database.
  • -if A ?B and B?C are strong. Is A?C a strong rule
  • -if A ?B and A?C are strong. Is B?C a strong rule

79
MIS 542 midterm S06 association constratint
  • The price of each item in a store is nonnegative.
    For the following cases indicate the type of
    constraints (such as monotone, untimonotone,
    tough, storngly convertable or succinct)
  • a) Containing at least one Nintendo Game.
  • b) The average price of items is between 100 and
    500.

80
BIS 541 2012/2013 Final II
  • 4. The questions about constaint-based
    association rule mining
  • The price of each item is nonnegative For the
    following cases indicate the type of constraints
    (monotonic, anti-monotonic or none)
  • a) the sum of prices of items is less then or
    equal to 10
  • b) the average price of items is less then or
    equal to 20

81
MIS 214 Final 2013/2015 Spring
  • (15 pt) Given that L4 (1,2,3,4),(2,4,5,6)where
    1,2,...,6 are ID s of items.
  • a) Write a L3 consisting of five 3-itemsets
  • b) Write a C3 of seven 3-itemsets

82
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

83
BIS 541 2011/2012 Final
  • 1. For each of the following problem identify
    relevant data mining tasks
  • a) A weather analyst is interested in
    calculating the likely change in temperatue for
    the coming days.
  • b) A marketing analyst is looking for the
    groups of customers so as to apply different CRM
    strategies for ecach group
  • c) A medical doctor must decide whether a
    set of symptoms is an indication of a particular
    disease.
  • d) A educational psychologist would like to
    determine exceptional students to sugget them for
    special educational programs. .

84
BIS 541 2011/2012 Final
  • 2. Develop a data warehouse for an insurance
    company using fact constellations scheme. The
    company holds insurance premiums paind by its
    customers for different type of policies as well
    as the payments in case of accidents to its
    customers. There are two facat tables for
    premiums and payments respectively. The
    dimensions are customer time, policy accident
    some are sheered by the two fact tables.
  • a) design the fact tables keys and measures
  • b) design the dimension tables their concept
    hierarchies
  • c) show one roll up and one drill down opperation

85
BIS 541 2011/2012 Final
  • 3. Consider a customer segmentation problem to
    be solved with k-means algorithm. . The following
    variables are available in the dataset gender,
    member card information, total spending in TL and
    education level.
  • a) what are the scales of these variables.?
  • b) How would you transform data before applying
    clustering?
  • c) How do you find similarity/dissimilarity
    between two customers?

86
BIS 541 2011/2012 Final
  • 4. Construct a particular node of a decision tree
    There are 6 data points at that node. The output
    is a categorical variable with two distinct
    values. Generate a dtra set of three variables
    one bieing the output (Y) the others are inputs
    (X1 and X2) such that X1 reduces the information
    gane as much as possible whereas X2 dose not
    reduces the information gain at all.

87
BIS 541 2011/2012 Final
  • 1. Generate two different data sets of two
    continuous input variables X1 and X2 for a
    clustering problem.
  • a) that would give almost the same set of
    clustering results when solved by k-means and
    k-medoids
  • b) that would give different set of clusters
    when solved by k-means and k-medoids

88
BIS 541 2011/2012 Final
  • 2. Develop a data warehouse for holding academic
    performance of an universitys faculty members.
    The dimensions are time (here academic year is
    important but the day of the publication is a bit
    detailed) faculty member, paper. For an article
    publiched by a factulty member at a particular
    paper, number of citations taken.and the implact
    factor of that paper are important. Paper can be
    journal articles, conference proceedings journals
    can be in SCI or SSCI and each such ournal or
    conference has a prestige factor a continous
    variable.
  • a) design the fact table keys and measures
  • b) design the dimension tables their concept
    hierarchies
  • c) describe in word fife different types of
    queries that can be answered by the OLAP cube
  • d) show two roll up and two drill down operation

89
BIS 541 2011/2012 Final
  • 3. Generate data sets for a supervised learning
    problem solved by neural networks.
  • a) There are two continuous independent
    variables X1 and X2 and a class variable with two
    different values such as yes and no. On the same
    artificially generatred dataset illustrate the
    concept of overfitting by neural networks.
  • b) Illustrate the behavior of training and test
    errors as the complexity of the network increases

90
BIS 541 2011/2012 Final
  • 4. Consider a classification problem to be solved
    by k-NN method. The output is whether the
    customer will buy a product or not. The inputs
    are income, age, education level of the customer
    and profession of the customer (having here
    distinct values)
  • a) Describe the data transformations needed in
    the preprocessing step to prepare the datra set
    to be classified by k-NN
  • b) How the data transformations are different
    from the solution of th same problem by neural
    networks.

91
BIS 541 2012/2013 Final II
  • 1 For each of the following problem identify
    relevant data mining tasks with a brief
    explanation
  • a) A weather analyst is interested in
    wheather the temperature will be up or down for
    the coming day
  • b) An insurance analyst intends to group
    policy holders according to characteristics of
    customers and policies
  • c) A medical researcher is looking for
    symptoms that are occurring together among a
    large set of pationes.
  • d) An educational program director would like
    to determine likely GPA of applicant to a MA
    program from their ALES scores, undergraduate
    GPAs and enterence exam scores.

92
BIS 541 2012/2013 Final II
  • 2. Develop a data warehouse for a weather bureau
    having so many probes located all over a large
    region, using star scheme. These probes collect
    basic weather data such as temperature , air
    pressure , humidity, at each hour. All the data
    is sent to a central station to be processed. .
  • a) design the fact table keys and measures
  • b) design the dimension tables their concept
    hierarchies
  • c) state two questions that can be answered by
    querying the warehouse.
  • d) show one roll up and one drill down operation
    abour one of these questions

93
BIS 541 2012/2013 Final II
  • Evaluate the four classification methods
    decision threes, neural networks, Bayesian
    classification and k-NN in terms of
  • a) accuricy
  • b) speed of model development and use
  • c) understandability and interpretability of
    output
  • d) handling of outlayers if not handled in
    preprocessing step

94
BIS 541 2012/2013 Final II
  • 4. The questions about constaint-based
    association rule mining
  • The price of each item is nonnegative For the
    following cases indicate the type of constraints
    (monotonic, anti-monotonic or none)
  • a) the sum of prices of items is less then or
    equal to 10
  • b) the average price of items is less then or
    equal to 20

95
BIS 541 2012/2013 Final II
  • 5. Based on a sample of 30 observations the
    population regression model
  • Y i ?0 ?1x i ?i
  • The least square estimates of intercept is 10.0
  • Sum of the values of dependent and independent
    variables are 450 and 150 respectively.
  • Estimated variance of dependent variable is 25,
    variance of the residuals is 4
  • a) What is the least square estimate of slope
    coefficient? Interpret the figure.
  • b) What are the values of SSR and SSE?
  • c) Find and interpret the coefficient of
    determination.
  • d) Test the null hypothesis that the explanatory
    variable X does not have a significant effect on
    Y at confidence level of 95.Critical value of
    F?0.05(1,28) 4.20

96
BIS 541 2013/2014 Final
  • 1. For each of the following problem identify
    relevant data mining tasks with a brief
    explanation
  • a) A financial analyst is interested in
    wheather the stock market index will be up or
    down for the coming day
  • b) Cities in Turkey are grouped according to
    their voting characteristics after the Republic
    of President election.
  • c) A security specialist is interested in
    determining mail message are spam or no looking
    at words passing the messages.
  • d) A medical doctor is interested in what
    symptoms (binary variables) occur together for a
    specific gtype of canser.

97
BIS 541 2013/2014 Final
  • 2. Evaluate the four clustering methods k-means,
    k-medoids, hierarchical, model-based
    (probalictic) in terms of
  • a) handling of non-spherical shapes
  • b) speed of model development
  • c) understandability and interpretability of
    output
  • d) sensitivity to outlayers.
  • In each of these aspects mention only the
    remarkable methods (you need not mantion all
    methods in all aspects)

98
BIS 541 2013/2014 Final
  • 3. Develop a data warehouse for the election to
    selection of president of republic. There are
    so many poll stations (sandik) located all over
    the country. Using star scheme.. Each pool
    station has valid notes for each of the three
    candidates, invalid ots and total number of
    voters. Each poll station has a set of lacation
    related variables such as district, city,.some
    characteristics of cities There is no time
    dimension in this version of the problem.

99
BIS 541 2013/2014 Final
  • a) design a warehouse with star shame fact table
    keys and measures and at least two calculated
    measures.
  • b) design the dimension tables their concept
    hierarchies
  • c) state two questions that can be answered by
    querying the warehouse.
  • d) show one roll up and one drill down operation
    abour one of these questions

100
BIS 541 2013/2014 Final
  • 4. Based on a sample of 50 observations the
    population regression model to predict number of
    automobile sales (dependent variable) based on
    advertisement placements (independent variable)
  • Y i ?0 ?1x i ?i
  • The least square estimates of slope is 2.0
  • Average of the values of independent variable is
    50. Sum of the values of dependent variable is
    5390.
  • Total sum of squares for dependent variable is
    9000 Variance of the residuals is 40

101
BIS 541 2013/2014 Final
  • a) What is the least square estimate of intercept
    coefficient? Interpret the figure.
  • b) Interpret the the slope coefficient.
  • b) What are the values of SSR and SSE?
  • c) Find and interpret the coefficient of
    determination.

102
Outline
  • Methodology - Overview
  • Introduction
  • Data Description Preprocessing
  • OLAP
  • Clustering
  • Classification
  • Numerical Prediction - Regression
  • Frequent Pattern Mining
  • Recent BIS Exams
  • Unclassified Questions

103
  • 5. (25 points) Consider a data set representing
    the interactions among a set of people. The
    degree of interaction is a positive real number
    high values can be interpreted as, the two
    members are closely related (they have close
    interactions such as heavy telephone calls or
    mail traffic between them) In other words rather
    then including the coordinates of variables
    directly, the similarity/dissimilarity matrix is
    given. This is a symmetric matrix. Develop an
    algorithm for clustering similar objects into
    same clusters. Assume that number of clusters (k)
    is given

104
  • 3. (25 points) Consider a data set of two
    continuous variables X and Y. X is right skewed
    and Y is left skewed. Both represent measures
    about same quantity (sales categories, exam
    grades,)
  • a) Draw typical distributions of X and Y
    separately.
  • b) Draw box plots of X and Y separately.
  • c) Draw q-plots (quantile) of X and Y
    separately.
  • d) Draw q-q plot of X and Y.

105
  • 4. (25 points) A strategy for clustering high
    dimensional data of continuous variables is
    First apply principle components to reduce the
    dimensionality of the data set and apply
    clustering on the reduced form of the data.
    Discuss the drawback(s) of this approach.

106
MIS 541 2012/2013 Final
  • 1. (20 pts) Consider a data set of two continuous
    variables X and Y. X both has the same mean, both
    have no skewness (symetric)ç X has a higher
    variance then Y. Both represent measures about
    same quantity (sales categories, exam grades,)
  • a) Draw typical distributions of X and Y on
    the same graph.
  • b) Draw box plots of X and Y separately.

107
MIS 541 2012/2013 Final
  • 2. (20 pts) Illustrate with plots of two
    continuous inputs and binary class that one layer
    neural networks are enough to classify convex
    class boundaries Two hidden layers are enough to
    capture even non convex class boundaries.

108
MIS 541 2012/2013 Final
  • 3. (20 pts) Consider association rules X ?Y where
    X is a categorical variable with more then two
    values and Y is originally continuous but
    discretize into categories. Give example
    variables for X and Y. Illustrate that confidence
    as an interestingness measure may be misleading.
    Suggest a modification to the classical
    confidence so as to eliminate its drawback for
    this type of variables.

109
MIS 541 2012/2013 Final
  • 4. (20 pts) The price of each item is nonnegative
    For the following cases indicate the type of
    constraints (monotone, anti-monotone, tough,
    strongly convertible or succinct)
  • a) the sum of prices of items is less then or
    equal to 10
  • b) the average price of items is less then or
    equal to 20

110
Midterm 2008/2009 Spring
  • 2.(20) Consider a classification problem in that
    customers that are taking consumer credits from a
    bank are classified into three risk groups The
    input variables are age discretized into 4
    groups, income into 4 groups, education into four
    groups, gender, number of months the customer is
    dealing with the bank and average delay of
    payments in months, and current value of the
    accont balance. The output variable has 3
    categories as risky, normal or highly risky
    calculated by some procedure and provided to the
    data miner. Design an encoding schema for the
    input and output variables so that the problem
    will be solved by a neural network Show a typical
    topology of a feedforward network architecture

111
Midterm 2008/2009 Spring
  • 3. (20 points) Consider a classification by a
    decision three problem. There are two categorical
    input variables A and B having two distinct
    values each. The output variable C has two
    distinct classes. Suppose the dataset is suitable
    for using decision threes. Is the order of
    selection of variables affects the
    classification error? Support your answer by
    generating data sets pictorially. (stoping
    condition is either a pure class is obtained or
    no variables remains to be tested)

112
Midterm 2008/2009 Spring
  • 4. (20 points) Principle components is used for
    dimensionality reduction then may be followed by
    cluster analysis say for segmentation purposes
    Consider a two continuous variable problem.
    Using scatter plots
  • a) Generate a data set where PCA reduces the
    dimensionality from two to one
  • b) Generate a data set where although there is a
    relation between the two variables, PCA
  • is not able to reduce the dimensionality to one
  • c) Generate a data set where there are natural
    clusters and PCA can reduce the dimensionality
  • d) Generate a data set where there are natural
    clusters but PCA is not the appropriate method
    for reducing the dimensionality
About PowerShow.com