MIDN 1C David G' Underhill - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

MIDN 1C David G' Underhill

Description:

MIDN 1C David G' Underhill – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 45
Provided by: davidun
Category:
Tags: midn | ass | david | underhill

less

Transcript and Presenter's Notes

Title: MIDN 1C David G' Underhill


1
Exploring Dimensionality Reduction for Text Mining
  • MIDN 1/C David G. Underhill
  • Assistant Professor Lucas K. McDowell
  • Computer Science Department

2
How to make sense of an overwhelming amount of
data?
3
How to make sense of an overwhelming amount of
data?
  • Can dimensionality reduction help?

4
Outline
  • Problem Statement
  • Background
  • Text Mining Process
  • Dimensionality Reduction
  • Experimental Analysis
  • Task 1 Classification
  • Task 2 Literature-Based Discovery
  • Contributions and Conclusions
  • Future Work

5
Text Mining Overview
Distance Matrix
Term Document Matrix
Encode
Compare
Analyze
6
Text Mining Overview
7
Dimensionality Reduction (DR)
  • Goal simplify a complex data set in a way that
    preserves meanings inherent in the original data
  • Usually applied to geometric or numerical data
  • How can DR improve text mining?
  • May reveal patterns obscured in the original data
  • Improves analysis time over the original, larger
    data
  • Greatly decreases storage and transmission costs

8
2-Dimensional Visualizations
  • Reduction to just 2 dimensions
  • Easy visualization graph on Cartesian plot
  • Each point is colored according to its category
  • Assess quality of separation with best 2
    dimensions
  • Highlight areas of confusion

9
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Experimental Question and Method
  • Task 1 Classification
  • Nearest Neighbor Classifier
  • Linear Classifier
  • Task 2 Literature-Based Discovery
  • Contributions and Conclusions
  • Future Work

10
Experimental Question
  • Can DR improve text mining performance?
  • Many valid DR approaches
  • Relative DR performance unknown for textual data

Ultimate Goal Identify DR techniques that best
facilitate text mining.
11
Experimental Method
  • Evaluate 5 DR methods
  • Linear
  • 1) PCA (Principal Components Analysis)
  • 2) MDS (Multidimensional Scaling)
  • Non-Linear
  • 3) Isomap
  • 4) LLE (Locally Linear Embedding)
  • 5) LDM (Lafons Diffusion Maps)
  • Baseline
  • None-Sort original features sorted by average
    weight
  • Evaluate on two text mining tasks
  • 1) Classification
  • 2) Literature-Based Discovery
  • Evaluate with three data sets
  • 1) Science News
  • 2) Google News
  • 3) Science Technology

12
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Experimental Question
  • Task 1 Classification
  • Nearest Neighbor Classifier
  • Linear Classifier
  • Task 2 Literature-Based Discovery
  • Contributions and Conclusions
  • Future Work

13
Classification
  • Labeling documents with known categories based on
    training data

Assessment accuracy of category assignments
14
k-Nearest Neighbor Classifier
  • Assign category based on k nearest neighbors
  • Most frequent category is assigned
  • k 9 used for following graphs
  • Trends similar for other values

15
kNN Classifier on Science News (8-category)
16
kNN Classifier on Google News
17
kNN Classifier on Science Technology
18
kNN Classifier on Science News (2-category)
19
kNN Classifier on Science News (4-category)
20
Linear Classifier
  • Assign category based on a linear combination of
    features
  • Assumes features are
  • normally distributed
  • Results for the quadratic classifier,
  • which doesnt make this assumption,
  • were comparable

21
Linear Classifier on Science News (8-category)
22
Linear Classifier on Google News
23
Linear Classifier on Science Technology
24
Classification Results
  • Applying DR improves accuracy versus not applying
    DR
  • Best DR techniques achieve high accuracy in few
    dimensions
  • MDS Isomap yield the most consistent and
    reliable results
  • This advantage is more pronounced on difficult
    corpuses
  • Contradicts van der Maaten et al. 2007 results
    show PCA best, but only evaluates one textual
    data set
  • PCA is good, but not the best it suffers on
    harder data sets

25
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Experimental Question
  • Task 1 Classification
  • Nearest Neighbor Classifier
  • Linear Classifier
  • 2D Visualizations
  • Task 2 Literature-Based Discovery
  • Contributions and Conclusions
  • Future Work

26
Literature-Based Discovery (LBD)
  • Identify candidate interesting associations
    between seemingly unrelated documents
  • Example Swansons manual discovery

High Blood Viscosity Platelet Aggregation
Fish Oil
Reynauds Disease
27
Automatic LBD Assessment
  • Time-Consuming
  • Subjective
  • Expensive

28
Literature-Based Discovery
Score Pairs
Assessment novelty scores of candidate
discoveries
29
Novelty Scoring Metric
  • Example Interesting Connection Found
  • Student sleep deprivation paired with
    cross-cultural approach to improving sleep
  • Computing the Novelty Score
  • 1) Compute relative significance on Google
    Scholar
  • 2) Compute relative significance on Google
  • 3) Compute novelty score estimate
  • Relevant on Google Scholar and not well-known on
    Google gt high novelty estimate

30
Novelty Scoring Metric
  • Example Uninteresting Pair gt Low Score
  • Anti-Counterfeiting paired with Adjusting
    Electromagnetic Properties of the Sacagawea Coin
  • Computing the Novelty Score
  • 1) Compute relative significance on Google
    Scholar
  • 2) Compute relative significance on Google
  • 3) Compute novelty score estimate
  • No distinction between Google and Google Scholar
    gt low novelty estimate

31
LBD Effectiveness on Science News (4-cat)
32
LBD Effectiveness on Science Technology
33
LBD Relative Effectiveness
None- Sort
None-Sort
LDM
LDM
PCA
PCA
LLE
LLE
MDS
MDS
Isomap
Isomap
Science News (4-cat)
Science News (8-cat)
Science Technology
Google News
None- Sort
None- Sort
PCA
LDM
PCA
LDM
MDS
LLE
LLE
MDS
Isomap
Isomap
34
LBD Results
  • For a fixed number of dimensions, applying DR can
    improve quality of candidate discoveries over not
    applying DR
  • Performance is effective even with relatively few
    dimensions
  • PCA and Isomap yield the most consistent and
    reliable results

35
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Contributions and Conclusions
  • Future Work

36
Conclusions and Contributions
  • Evaluated two distinct text mining processes with
    regards to dimensionality reduction techniques
  • Showed that DR can be highly effective
  • Surprisingly, non-linear techniques did not
    improve performance
  • Classification
  • PCA (most commonly used) is inconsistent for text
    classification
  • MDS and Isomap are often the best
  • Literature-Based Discovery
  • Developed novel keyword extraction and LBD
    scoring techniques
  • PCA was often the best
  • May want to combine results from multiple
    techniques to maximize performance

37
Outline
  • Problem Statement
  • Background
  • Experimental Analysis
  • Contributions and Conclusions
  • Future Work

38
Future Work
  • Human evaluation of LBD results
  • LBD is subjective
  • Would benefit from human analysis
  • Examination of document pairs found by more than
    one technique
  • May be superior
  • Improved keyword extraction
  • Part of speech tagging
  • Improve DR process
  • Ability to insert new documents
  • Efficiency

39
Acknowledgements
  • Naval Surface Warfare Center, Dahlgren Division
  • Dr. David Marchette
  • Dr. Jeff Solka
  • Trident Scholar Committee
  • Office of Naval Research
  • Multimedia Support Center (MSC)
  • Publications Office (PAO)

40
Exploring Dimensionality Reduction for Text Mining
  • MIDN 1/C David G. Underhill
  • Assistant Professor Lucas K. McDowell
  • Computer Science Department

41
2D Visualization of Science News (2-cat)
42
2D Visualization of Science News (8-cat)
43
2D Visualization of Google News
44
2D Visualization of Science Technology
Write a Comment
User Comments (0)
About PowerShow.com