Data Mining for an Educational Webbased System - PowerPoint PPT Presentation

1 / 42
About This Presentation

Data Mining for an Educational Webbased System


Intelligent automated tools needed to discover relevant, useful, and interesting ... discovered rules to produce more intelligent system. 9/16/09. Thesis ... – PowerPoint PPT presentation

Number of Views:874
Avg rating:3.0/5.0
Slides: 43
Provided by: rje7


Transcript and Presenter's Notes

Title: Data Mining for an Educational Webbased System

Data Mining for anEducational Web-based System
  • Behrouz Minaei
  • Department of Computer Science and Engineering
  • Thesis Proposal
  • January 30th 2004

  • Statement of problem
  • Classification (Prediction student performance)
  • Clustering (Ensembles of multiple clusterings)
  • Additional Proposed work
  • Tentative Schedule

Statement of problem
  • Statement of problem
  • Data Mining
  • Data Preprocessing
  • Contributions
  • G. Albertelli, B. Minaei-Bigdoli, W.F. Punch, G.
    Kortemeyer, and E. Kashy, Concept Feedback In
    Computer-Assisted Assignments, Proceedings of
    the (IEEE/ASEE) Frontiers in Education
    conference, 2002 Boston
  • M. Hall, J. Parker, B. Minaei-Bigdoli,G.
    Albertelli, G. Kortemeyer, and E. Kashy,
    Gathering and Timely Use of Feedback from
    Individualized On-line Work with an Open-Source
    CMS submitted to (IEEE/ASEE) FIE 2004 Frontier
    In Education, Oct. 2004 Lavannah
  • Classification (Prediction student performance)
  • Clustering (Ensembles of multiple clusterings)
  • Additional proposed work
  • Tentative Schedule

  • This research is a part of the latest online
    educational system developed at Michigan State
    University (MSU), the Learning Online Network
    with Computer-Assisted Personalized Approach
  • Learning Content Management System
  • 9 high schools, 2 community colleges, and 17
    universities nationwide
  • Assessment System
  • Online assessment with immediate feedback and
    multiple tries
  • Different students get different versions of the
    same problem
  • Different options, graphs, images, numbers, or
  • Open-Source and Free (GPL, Runs on Linux)

  • Three kinds of growing data sets
  • Educational resources web pages, demonstrations,
    simulations, individualized problems, quizzes,
    and examinations.
  • Information about users who create, modify,
    assess, or use these resources.
  • Data about how students use and access the
    educational materials

MSU Fall 2003
  • 40 courses used LON-CAPA at MSU
  • Total student enrollment approximately 3,067 (out
    of 13,400 total global student-users)
  • Disciplines included Advertising, Biochemistry,
    Biology, Chemistry, Finance, Geology, Math,
    Physics, Plant Biology, Statistics for Psychology

Statement of problem
  • LON-CAPA collects data for every single access to
    the resources in both activity log and student
  • Logs are not only huge but also distributed and
    specific to a web-based educational system
  • Intelligent automated tools needed to discover
    relevant, useful, and interesting patterns
  • Apply the discovered rules to produce more
    intelligent system

Knowledge Discovery Process
  • Data Integration, removing inconsistency,
  • Data Cleansing, correcting errors, missing values
  • Discretization, transform continuous to
  • Feature Selection, features are more relevant
  • Mining process, rule discovery
  • Post-processing,
  • Large set rules ? simplify
  • 1) More comprehensible, 2) More interesting
  • Use combination of objective and subjective

Data Mining Tasks
  • Classification
  • The goal is to predict the class variable based
    on the feature values of samples Avoid
  • Clustering (unsupervised learning)
  • Association Analysis
  • Find the binary relationship among the data items
  • Any feature variable can occur both in antecedent
    and in the consequent of a rule.

Contributions (1)
  • Our claim is that data mining can help to design
    better and more intelligent educational web-based

Can help instructor to design the course more
effectively, detect anomaly
Can help students to use the resources more

Contributions (2)
Contributions (3)
Can find some associative rules between
students educational activities
Can be used to identify those students who are at
risk, especially in very large classes
Can help instructors predict the approaches that
students will take for some types of problems
Prediction student performance
  • Statement of problem
  • Classification
  • Combination of Classifiers
  • Weighting the features
  • Using a Genetic Algorithm to find the best set of
  • B. Minaei-Bidgoli, W.F. Punch, Using Genetic
    Algorithms for Data Mining Optimization in an
    Educational Web-based System, GECCO 2003,
    2252-2263, July 2003 Chicago.
  • B. Minaei-Bidgoli, D.A. Kashy, G. Kortemeyer,
    W.F. Punch, Predicting Student Performance An
    Application of Data Mining Methods with an
    educational Web-based System, (IEEE/ASEE) FIE
    2003 Frontier In Education, Nov. 2003 Boulder
  • Clustering (Ensembles of multiple clusterings)
  • Proposed work
  • Tentative Schedule

Data Set PHY183 SS02
  • 227 students
  • 12 Homework sets
  • 184 Problems
  • 80 MB activity log
  • 26 MB useful data
  • 220,000 transactions
  • Extracted Features
  • Total number of correct answers. (Success rate)
  • Success at the first try
  • Number of attempts to get answer
  • Time spent until correct
  • Total time spent on the problem
  • Participating in the communication mechanisms

Class Labels (3 possibilities)
  • Non-Tree Classifiers (Using MATLAB)
  • Bayesian Classifier
  • 1NN
  • kNN
  • Multi-Layer Perceptron
  • Parzen Window
  • Combination of Multiple Classifiers (CMC)
  • Genetic Algorithm (GA), Optimizer
  • Decision Tree-Based Software
  • C5.0 (RuleQuest ltltC4.5ltltID3)
  • CART (Salford-systems)
  • QUEST (Univ. of Wisconsin)
  • CRUISE use an unbiased variable selection

Fitness/Evaluation Function
  • 5 classifiers
  • Multi-Layer Perceptron 2 Minutes
  • Bayesian Classifier
  • 1NN
  • kNN
  • Parzen Window
  • CMC 3 seconds
  • Divide data into training and test sets (10-fold
  • Fitness function performance achieved by

Individual Representation
  • The GA Toolbox supports binary, integer and
    floating-point chromosome representations.
  • Chrom crtrp(N, FieldDR) creates a random
    real-valued matrix of N x d, where N is number
    of individuals (200) and FieldDR is a matrix of
    size 2 x d and contains the boundaries of each
    variable of an individual.
  • FieldDR 0 0 0 0 0 0 lower bound
  • 1 1 1 1 1 1 upper bound
  • Chrom 0.23 0.17 0.95 0.38 0.06 0.26
  • 0.35 0.09 0.43 0.64 0.20
  • 0.50 0.10 0.09 0.65 0.68
  • 0.21 0.29 0.89 0.48 0.63

Results of using GA
GA Optimization Results
Features importance
Contribution of the classification
  • A new approach to evaluating student usage of
    web-based instruction
  • An approach that is easily adaptable to different
    types of courses, different population sizes, and
    different attributes to be analyzed
  • Rigorous application of known classifiers as a
    means of analyzing and comparing use and
    performance of students who have taken a
    technical course that was partially/completely
    administered via the web

  • Statement of problem
  • Classification (Prediction student performance)
  • Clustering (Ensembles of multiple clusterings)
  • B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
    Ensembles of Partitions via Data Resampling,
    Proc. Intl. Conf. on Information Technology,
    ITCC/IEEE 2004, in press
  • B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
    Effect of the Resampling Methods on Clustering
    Ensemble Efficacy, prepared to submit to Intl.
    Conf. on Machine Learning Models, Technologies
    and Applications, 2004
  • A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F.
    Punch, Adaptive Clustering Ensembles, submitted
    to Intl. Conf on Pattern Recognition, ICPR 2004
  • Proposed work
  • Tentative Schedule

  • Combinations of classifiers proved to be very
    effective in supervised learning framework, e.g.
    bagging and boosting algorithms
  • In LON-CAPA, the course and student data are
  • Distributed data mining requires efficient
    algorithms capable to integrate the solutions
    obtained from multiple sources of data and
  • Ensembles of clusterings can provide novel,
    robust, and stable solutions

Taxonomy of Clustering Combination Approaches
Resampling Methods
  • Bootstrapping (Sampling with replacement)
  • Create an artificial list by randomly drawing N
    elements from that list. Some elements will be
    picked more than once.
  • Statistically on average 37 of elements are
  • Subsampling (Sampling without replacement)
  • Control over the size of subsample

Related work on bootstrap partitioning
  • Estimate the number of clusters
  • (Jain Moreau1987), (Fridlyand Dudoit 2001),
  • Clustering validity/reliability
  • (Jain Moreau1987), (Fischer Buhmann 2003)
  • Find a measure for clustering stability
  • (Ben-Hur et. al, 2002),
  • Clustering combination
  • (Fridlyand Dudoit 2001) (Fischer Buhmann 2003)
  • (Monti, et al., 2003.)

Experiment Data sets
Two-spiral and Halfrings data sets
Halfrings 400 patterns (100-300)
2-Spirals 200 patterns (100-100)
Bootstrap results on Iris
Subsampling on Halfrings
Subsampling results on Galaxy/Star
Error Rate for Individual Clustering
Summary of the best results of Bootstrap
(No Transcript)
Additional Proposed work
  • Statement of problem
  • Classification (Prediction student performance)
  • Clustering (Ensembles of multiple clusterings)
  • Additional Proposed work
  • Association Analysis
  • Dynamic mining
  • Tentative Schedule

A sequence-based clustering
  • The problem
  • given students browsing data and course
    contents, find clusters of learners with similar
  • order of browsed pages matters
  • P ? P ? R1? R2 ? P ? A
  • R1? R2 ? P ? A
  • P ? A
  • P ? P ? P ? P ? P ? P ? P ? P ? P ? A
  • P ? P ? P ? P ? P
  • R3 ? R2 ? P
  • Cluster students based on a similarity function

Web usage mining
  • Many techniques have been investigated in the
    e-commerce and CRM
  • Some can be adapted, some can not
  • The goals are different,
  • The user model is different
  • Analyzing students interactions with the
    LON-CAPA and take actions accordingly. It is the
    path traversal pattern or similar to web
    sequential pattern mining or web log mining.

Association Analysis postprocessing
  • Association rules mining studies the frequency of
    items occurring together in a given set of data.
  • Solving the discretization problem for continuous
  • Post analysis of the discovered knowledge in
    terms of the interestingness, usefulness and so
    on. What is useful or interesting is a domain
    dependent, need to talk to LON-CAPA
  • Strategic use of data and discovered knowledge

Dynamic mining LON-CAPA Examples
You are about to start a test. Other students
similar to you, who succeeded in this test, have
also accessed Section 5 of Chapter 3. You did
not. Would you like to access it now before
attempting the test? Yes No
Based on your time access to solve the problem
(Circular Motion), It seems that you are not
thinking about the problem, It is better to see
the following pages and then submit your
answers Motion in 2 Dimensions Force and Motion
Momentum and Collisions
Someone answered the question you posted on the
Bulletin Board yesterday. Would you like to read
it now? Yes No
Degree of difficulty of problem 3 in homework
set 5 jumped into greater than 90 in the first
5 hours of student access? There might be
something wrong in designing the problem. Would
you like to revise it now? Yes
Conclusion Enhancing web-based learning
  • L-C servers are tracking students activities in
    large logs
  • The knowledge discovered could be analyzed and
    evaluated by knowledge experts (off-line mining)
  • Integrated mining The patterns discovered are
    fed back to a system that seamlessly and
    transparently would make systems behave
  • We could pass the data through a magic box to
    find some obscure patterns.
  • Tools to recommend tasks, automatically adapt
    course materials
  • Tools can be personalized, manually or

Tentative Schedule
Write a Comment
User Comments (0)