Data Mining and Milwaukee: Mining Community Health Risk Factors - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Data Mining and Milwaukee: Mining Community Health Risk Factors

Description:

Priya Hastagiri. Dale Steber. Madhuri Battu. Goal of the Project ... Build a good classifier which predicts the overall health status of the individuals ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 18
Provided by: itsl6
Category:

less

Transcript and Presenter's Notes

Title: Data Mining and Milwaukee: Mining Community Health Risk Factors


1
Data Mining and MilwaukeeMining Community
Health Risk Factors
  • Team Members
  • Srinivas Bodapati
  • Priya Hastagiri
  • Dale Steber
  • Madhuri Battu

2
Goal of the Project
  • Create a smaller survey from the existing
    comprehensive survey

3
Data Mining Tasks
  • Classification
  • Build a good classifier which predicts the
    overall health status of the individuals
  • Feature Selection
  • Identify the meaningful attributes that impact
    the overall health status

4
Data Transformation
  • 1200 instances
  • 275 attributes
  • Cleaned and consolidated attributes to 94
  • Eliminated nulls
  • Consolidated attributes that separated out based
    on gender
  • Eliminated attributes that are irrelevant (Zip
    code, state, city)
  • Converted the SPSS file into Excel format for
    clean up
  • Cleaned data was loaded into an Access database
  • Helps for easy loading in WEKA
  • Modified the WEKAs jar file to add the new
    database connectivity in the jdbc connection file

5
Methods
  • Methods
  • Split data set into 2/3 and 1/3 sets (800
    instances training set and 400 test set)
  • Classification
  • ZeroR model
  • OneR Model
  • Decision tree( j48 Algorithm)
  • Naïve Bayesian

6
Baseline Models
  • ZeroR
  • Accuracy 34.125
  • OneR
  • Accuracy 44.75

7
Model Interpretation(Before Feature Selection)
8
Attribute Selection
  • Algorithms
  • Information Gain
  • Relief-F
  • Principal Component Analysis

9
Information Gain
  • Top attributes selected
  • Income
  • Health Care Coverage
  • Education
  • Depression and Stress
  • Marital Status
  • Exercise
  • Smoking
  • Alcohol Usage
  • Pneumonia
  • Blood Pressure and Cholesterol

10
Relief-F
  • Top attributes selected
  • Income
  • Health Care Coverage
  • Education
  • Depression and Stress
  • Marital Status
  • Exercise
  • Smoking (support of banning)
  • Colonoscopy Exam
  • Sunscreen
  • Fast Food

11
Model Interpretation(After Feature Selection)
12
Principal Components
  • Previously the analysis would not stop
  • The reason WEKA did not stop was due to out of
    memory errors
  • Modified the runweka.bat file to instruct the
    virtual machine to use more system memory by
    adding XmemX 1000M
  • WEKA completes, but we have not had time to
    interpret the results.

13
Where do we go from here?
  • Complete Principal Component Analysis
  • Reevaluate representation of attributes
  • Eliminate attribute with large amounts of missing
    values
  • Standardize the data set
  • Eliminate instances that have a class of Not
    Sure

14
Split the Data Set
  • Confusion matrix
  • Split data set on goal attribute
  • Good, Very Good
  • Poor, Fair, and Excellent

15
Confusion Matrix
16
Steps for split
  • Excellent, Fair, Poor
  • Repeat steps
  • Very good, Good
  • Apply support vector machines

17
Thank You
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com