1 / 42

Data Mining for anEducational Web-based System

- Behrouz Minaei
- Department of Computer Science and Engineering
- Thesis Proposal
- January 30th 2004

Topics

- Statement of problem
- Classification (Prediction student performance)
- Clustering (Ensembles of multiple clusterings)
- Additional Proposed work
- Tentative Schedule

Statement of problem

- Statement of problem
- LON-CAPA
- Data Mining
- Data Preprocessing
- Contributions
- G. Albertelli, B. Minaei-Bigdoli, W.F. Punch, G.

Kortemeyer, and E. Kashy, Concept Feedback In

Computer-Assisted Assignments, Proceedings of

the (IEEE/ASEE) Frontiers in Education

conference, 2002 Boston - M. Hall, J. Parker, B. Minaei-Bigdoli,G.

Albertelli, G. Kortemeyer, and E. Kashy,

Gathering and Timely Use of Feedback from

Individualized On-line Work with an Open-Source

CMS submitted to (IEEE/ASEE) FIE 2004 Frontier

In Education, Oct. 2004 Lavannah - Classification (Prediction student performance)
- Clustering (Ensembles of multiple clusterings)
- Additional proposed work
- Tentative Schedule

LON-CAPA

- This research is a part of the latest online

educational system developed at Michigan State

University (MSU), the Learning Online Network

with Computer-Assisted Personalized Approach

(LON-CAPA). - Learning Content Management System
- 9 high schools, 2 community colleges, and 17

universities nationwide - Assessment System
- Online assessment with immediate feedback and

multiple tries - Different students get different versions of the

same problem - Different options, graphs, images, numbers, or

formulas - Open-Source and Free (GPL, Runs on Linux)

LON-CAPA Data

- Three kinds of growing data sets
- Educational resources web pages, demonstrations,

simulations, individualized problems, quizzes,

and examinations. - Information about users who create, modify,

assess, or use these resources. - Data about how students use and access the

educational materials

MSU Fall 2003

- 40 courses used LON-CAPA at MSU
- Total student enrollment approximately 3,067 (out

of 13,400 total global student-users) - Disciplines included Advertising, Biochemistry,

Biology, Chemistry, Finance, Geology, Math,

Physics, Plant Biology, Statistics for Psychology

Statement of problem

- LON-CAPA collects data for every single access to

the resources in both activity log and student

database - Logs are not only huge but also distributed and

specific to a web-based educational system

(LON-CAPA) - Intelligent automated tools needed to discover

relevant, useful, and interesting patterns - Apply the discovered rules to produce more

intelligent system

Knowledge Discovery Process

- Data Integration, removing inconsistency,
- Data Cleansing, correcting errors, missing values
- Discretization, transform continuous to

categorical - Feature Selection, features are more relevant
- Mining process, rule discovery
- Post-processing,
- Large set rules ? simplify
- 1) More comprehensible, 2) More interesting
- Use combination of objective and subjective

approaches

Data Mining Tasks

- Classification
- The goal is to predict the class variable based

on the feature values of samples Avoid

Overfitting - Clustering (unsupervised learning)
- Association Analysis
- Find the binary relationship among the data items
- Any feature variable can occur both in antecedent

and in the consequent of a rule.

Contributions (1)

- Our claim is that data mining can help to design

better and more intelligent educational web-based

environment

Can help instructor to design the course more

effectively, detect anomaly

Can help students to use the resources more

efficiently

Contributions (2)

Contributions (3)

Can find some associative rules between

students educational activities

Can be used to identify those students who are at

risk, especially in very large classes

Can help instructors predict the approaches that

students will take for some types of problems

Prediction student performance

- Statement of problem
- Classification
- Combination of Classifiers
- Weighting the features
- Using a Genetic Algorithm to find the best set of

weights - B. Minaei-Bidgoli, W.F. Punch, Using Genetic

Algorithms for Data Mining Optimization in an

Educational Web-based System, GECCO 2003,

2252-2263, July 2003 Chicago. - B. Minaei-Bidgoli, D.A. Kashy, G. Kortemeyer,

W.F. Punch, Predicting Student Performance An

Application of Data Mining Methods with an

educational Web-based System, (IEEE/ASEE) FIE

2003 Frontier In Education, Nov. 2003 Boulder - Clustering (Ensembles of multiple clusterings)
- Proposed work
- Tentative Schedule

Data Set PHY183 SS02

- 227 students
- 12 Homework sets
- 184 Problems
- 80 MB activity log
- 26 MB useful data
- 220,000 transactions
- Extracted Features

- Total number of correct answers. (Success rate)
- Success at the first try
- Number of attempts to get answer
- Time spent until correct
- Total time spent on the problem
- Participating in the communication mechanisms

Class Labels (3 possibilities)

2-Classes

3-Classes

9-Classes

Classifiers

- Non-Tree Classifiers (Using MATLAB)
- Bayesian Classifier
- 1NN
- kNN
- Multi-Layer Perceptron
- Parzen Window
- Combination of Multiple Classifiers (CMC)
- Genetic Algorithm (GA), Optimizer
- Decision Tree-Based Software
- C5.0 (RuleQuest ltltC4.5ltltID3)
- CART (Salford-systems)
- QUEST (Univ. of Wisconsin)
- CRUISE use an unbiased variable selection

technique

Fitness/Evaluation Function

- 5 classifiers
- Multi-Layer Perceptron 2 Minutes
- Bayesian Classifier
- 1NN
- kNN
- Parzen Window
- CMC 3 seconds
- Divide data into training and test sets (10-fold

Cross-Validation) - Fitness function performance achieved by

classifier

Individual Representation

- The GA Toolbox supports binary, integer and

floating-point chromosome representations. - Chrom crtrp(N, FieldDR) creates a random

real-valued matrix of N x d, where N is number

of individuals (200) and FieldDR is a matrix of

size 2 x d and contains the boundaries of each

variable of an individual. - FieldDR 0 0 0 0 0 0 lower bound
- 1 1 1 1 1 1 upper bound
- Chrom 0.23 0.17 0.95 0.38 0.06 0.26
- 0.35 0.09 0.43 0.64 0.20

0.54 - 0.50 0.10 0.09 0.65 0.68

0.46 - 0.21 0.29 0.89 0.48 0.63

0.89

Results of using GA

GA Optimization Results

Features importance

Contribution of the classification

- A new approach to evaluating student usage of

web-based instruction - An approach that is easily adaptable to different

types of courses, different population sizes, and

different attributes to be analyzed - Rigorous application of known classifiers as a

means of analyzing and comparing use and

performance of students who have taken a

technical course that was partially/completely

administered via the web

Clustering

- Statement of problem
- Classification (Prediction student performance)
- Clustering (Ensembles of multiple clusterings)
- B. Minaei-Bidgoli, A. Topchy and W.F. Punch,

Ensembles of Partitions via Data Resampling,

Proc. Intl. Conf. on Information Technology,

ITCC/IEEE 2004, in press - B. Minaei-Bidgoli, A. Topchy and W.F. Punch,

Effect of the Resampling Methods on Clustering

Ensemble Efficacy, prepared to submit to Intl.

Conf. on Machine Learning Models, Technologies

and Applications, 2004 - A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F.

Punch, Adaptive Clustering Ensembles, submitted

to Intl. Conf on Pattern Recognition, ICPR 2004 - Proposed work
- Tentative Schedule

Motivation

- Combinations of classifiers proved to be very

effective in supervised learning framework, e.g.

bagging and boosting algorithms - In LON-CAPA, the course and student data are

distributed - Distributed data mining requires efficient

algorithms capable to integrate the solutions

obtained from multiple sources of data and

features - Ensembles of clusterings can provide novel,

robust, and stable solutions

Taxonomy of Clustering Combination Approaches

Resampling Methods

- Bootstrapping (Sampling with replacement)
- Create an artificial list by randomly drawing N

elements from that list. Some elements will be

picked more than once. - Statistically on average 37 of elements are

repeated - Subsampling (Sampling without replacement)
- Control over the size of subsample

Related work on bootstrap partitioning

- Estimate the number of clusters
- (Jain Moreau1987), (Fridlyand Dudoit 2001),
- Clustering validity/reliability
- (Jain Moreau1987), (Fischer Buhmann 2003)
- Find a measure for clustering stability
- (Ben-Hur et. al, 2002),
- Clustering combination
- (Fridlyand Dudoit 2001) (Fischer Buhmann 2003)
- (Monti, et al., 2003.)

Experiment Data sets

Two-spiral and Halfrings data sets

Halfrings 400 patterns (100-300)

2-Spirals 200 patterns (100-100)

Bootstrap results on Iris

Subsampling on Halfrings

Subsampling results on Galaxy/Star

Error Rate for Individual Clustering

Summary of the best results of Bootstrap

(No Transcript)

Additional Proposed work

- Statement of problem
- Classification (Prediction student performance)
- Clustering (Ensembles of multiple clusterings)
- Additional Proposed work
- Association Analysis
- Dynamic mining
- Tentative Schedule

A sequence-based clustering

- The problem
- given students browsing data and course

contents, find clusters of learners with similar

behavior - order of browsed pages matters
- P ? P ? R1? R2 ? P ? A
- R1? R2 ? P ? A
- P ? A
- P ? P ? P ? P ? P ? P ? P ? P ? P ? A
- P ? P ? P ? P ? P
- R3 ? R2 ? P
- Cluster students based on a similarity function

Web usage mining

- Many techniques have been investigated in the

e-commerce and CRM - Some can be adapted, some can not
- The goals are different,
- The user model is different
- Analyzing students interactions with the

LON-CAPA and take actions accordingly. It is the

path traversal pattern or similar to web

sequential pattern mining or web log mining.

Association Analysis postprocessing

- Association rules mining studies the frequency of

items occurring together in a given set of data. - Solving the discretization problem for continuous

features - Post analysis of the discovered knowledge in

terms of the interestingness, usefulness and so

on. What is useful or interesting is a domain

dependent, need to talk to LON-CAPA

instructors/authors - Strategic use of data and discovered knowledge

Dynamic mining LON-CAPA Examples

You are about to start a test. Other students

similar to you, who succeeded in this test, have

also accessed Section 5 of Chapter 3. You did

not. Would you like to access it now before

attempting the test? Yes No

Based on your time access to solve the problem

(Circular Motion), It seems that you are not

thinking about the problem, It is better to see

the following pages and then submit your

answers Motion in 2 Dimensions Force and Motion

Momentum and Collisions

Someone answered the question you posted on the

Bulletin Board yesterday. Would you like to read

it now? Yes No

Degree of difficulty of problem 3 in homework

set 5 jumped into greater than 90 in the first

5 hours of student access? There might be

something wrong in designing the problem. Would

you like to revise it now? Yes

No

Conclusion Enhancing web-based learning

- L-C servers are tracking students activities in

large logs - The knowledge discovered could be analyzed and

evaluated by knowledge experts (off-line mining) - Integrated mining The patterns discovered are

fed back to a system that seamlessly and

transparently would make systems behave

intelligently. - We could pass the data through a magic box to

find some obscure patterns. - Tools to recommend tasks, automatically adapt

course materials - Tools can be personalized, manually or

automatically

Tentative Schedule