Title: Graphical%20Data%20Mining%20for%20Computational%20Estimation%20in%20Materials%20Science%20Applications
1Graphical Data Mining for Computational
Estimation in Materials Science Applications
- Aparna Varde
- Ph.D. Dissertation
- August 15, 2006
- Committee Members
- Prof. Elke Rundensteiner (Advisor)
- Prof. Carolina Ruiz
- Prof. David Brown
- Prof. Neil Heffernan
- Prof. Richard Sisson Jr. (Head of Materials
Science, WPI)
This work is supported by the Center for Heat
Treating Excellence and by Department of Energy
Award DE-FC-07-01ID14197
2Introduction
- Scientific domains Experiments conducted with
given input conditions - Results plotted as graphs Good visual depictions
- Experimental results help in analysis Assist
decision-making - Performing experiment Consumes time and resources
3Motivating Example
- Heat Treating of Materials
- Controlled heating cooling of materials to
achieve mechanical thermal properties - Performing experiments involves
- One time cost 1000s
- Recurrent costs 100s
- Time 5 to 6 hours
- Human labor
- Desirable to estimate
- Graphs given input conditions
- Conditions to achieve given graph
CHTE Experimental Setup
4Problem Definition
- To develop an estimation technique with following
goals
- Given input conditions in an experiment, estimate
resulting graph
2. Given desired graph in an experiment, estimate
conditions to obtain it
5Proposed Estimation Approach AutoDomainMine
6Knowledge Discovery in AutoDomainMine
7Estimation of Graph in AutoDomainMine
8Estimation of Conditions in AutoDomainMine
9Main Tasks
Task 1 AutoDomainMine Learning Strategy of
Integrating Clustering and Classification AAAI-0
6 Poster, ACM SIGARTs ICICIS-05
Task 2 Learning Domain-Specific Distance Metrics
for Graphs ACM KDDs MDM-05, MTAP-06 Journal
Task 3 Designing Semantics-Preserving Representati
ves for Clusters ACM SIGMODS IQIS-06, ACM
CIKM-06
10Task 2 Learning Domain-Specific Distance
Metrics for Graphs
11Motivation
- Various distance metrics
- Absolute position of points
- Statistical observations
- Critical features
- Issues
- Not known what metrics apply
- Multiple metrics may be relevant
- Need for distance metric learning in graphs
Example of domain-specific problem
12Proposed Distance Metric Learning Approach
LearnMet
- Given
- Training set with actual clusters of graphs
- Additional Input
- Components distance metrics applicable to graphs
- LearnMet Metric
- D ?wiDi
13Evaluate Accuracy
- Use pairs of graphs
- A pair (ga,gb) is
- TP - same predicted, same actual cluster (g1,
g2) - TN - different predicted, different actual
clusters (g2,g3) - FP - same predicted cluster, different actual
clusters (g3,g4) - FN - different predicted, same actual clusters
(g4,g5)
14Evaluate Accuracy (Contd.)
- How do we compute error for whole set of graphs?
- For all pairs
- Error Measure
- Failure Rate FR
- FR (FPFN) / (TPTNFPFN)
- Error Threshold (t)
- Extent of FR allowed
- If (FR lt t) then clustering is accurate
15Adjust the Metric
- Weight Adjustment Heuristic for each Di
- New wi wi sfi (DFNi/DFN DFPi/DFP)
KDDs MDM-05
16Evaluation of LearnMet
- Details MTAP-06
- Effect of pairs per epoch (ppe)
- G number of graphs, e.g., 25
- GC2 total number of pairs, e.g., 300
- Select subset of GC2 pairs per epoch
- Observations
- Highest accuracy with middle range of ppe
- Learning efficiency best with low ppe
- Average accuracy with LearnMet 86
Accuracy of Learned Metrics over Test Set
Learning Efficiency over Training Set
17Task 3 Designing Semantics-Preserving
Representatives for Clusters
18Motivation
- Different combinations of conditions could lead
to a single cluster - Graphs in a cluster could have variations
- Need for designing representatives that
- Incorporate semantics
- Avoid visual clutter
- Cater to various users
19Proposed Approach for Designing Representatives
DesRept
20Candidates for Conditions
1. Nearest Representative
Set of conditions in Cluster A
Nearest Representative for Cluster A
- Return set of conditions closest to all others in
cluster - Notion of distance Domain-specific distance
metric from decision tree paths CIKM-06
21Candidates for Conditions (Contd.)
2. Summarized Representative
Cluster A
- Build sub-clusters of condition using domain
knowledge - Return nearest sub-cluster representatives
- Sort them
Sub-clusters within the Cluster A
Summarized Representative for Cluster
22Candidates for Conditions (Contd.)
3. Combined Representative
Cluster A
Combined Representative for Cluster A
- Return all sets of conditions
- Sort them in ascending order
23Candidates for Graphs
1. Nearest Representative
- Select graph that is nearest neighbor for all
others - Notion of distance Domain-specific metric from
LearnMet
24Candidates for Graphs (Contd.)
2. Medoid Representative
- Select graph closest to average of all graphs
- Average of y-coordinate values since
x-coordinates are same
25Candidates for Graphs (Contd.)
3. Summarized Representative
- Construct average graph with prediction limits
- Average centroid, Prediction limits
domain-specific thresholds
26Candidates for Graphs (Contd.)
3. Combined Representative
- Construct superimposed graph of all graphs in
cluster - Same x-values, so plot y-values on a common x-axis
27Effectiveness Measure for Candidates
- Minimum Description Length Principle
- Theory Representative, Examples all items in
cluster - Representative Measure Complexity (ease of
interpretation) - Complexity log2 N for graphs, log2 AV for
conditions, - N number of points to store representative
graph - A number of attributes for conditions,
- V number of values in representative set of
conditions - Examples Measure distance of items from
representative (information loss) - Distance for graphs log2 (1/G)?i1 to G
D(r,gi) - D distance using domain-specific metric
- G total number of graphs in cluster
- gi each graph
- r representative graph
- Encoding SIGMOD IQIS-06
- Effectiveness UBCComplexity UBDDistance
28Evaluation of DesRept Conditions
- Details
- Data Set Size 400, Number of Clusters 20
- Observations
- Overall winner is Summarized
- As weight for complexity increases, Nearest wins
- Designed better than Random
29Evaluation of DesRept Graphs
- Details
- Data Set Size 400, Number of Clusters 20
- Observations
- Overall winner is Summarized
- As weight for complexity increases, Nearest /
Medoid wins - Designed better than Random
30User Evaluation of AutoDomainMine System
- Formal user surveys in different applications
- Evaluation Process
- Compare estimation with real data in test set
- If they match estimation is accurate
- Observations
- Estimation Accuracy around 90 to 95
Accuracy Estimating Conditions
Accuracy Estimating Graphs
31Related Work
- Similarity Search HK-01, WF-00
- Non-matching conditions could be significant
- Mathematical Modeling M-95, S-60
- Existing models not applicable under certain
situations - Case-based Reasoning K-93, AP-03
- Adaptation of cases not feasible with graphs
- Learning nearest neighbor in high-dimensional
spaces HAK-00 - Focus is dimensionality reduction, do not deal
with graphs - Distance metric learning given basic formula
XNJR-03 - Deal with position-based distances for points, no
graphs involved - Similarity search in multimedia databases KB-04
- Use various metrics in different applications, do
not learn a single metric - Image Rating HH-01
- User intervention involved in manual rating
- Semantic Fish Eye Views JP-04
- Display multiple objects in small space, no
representatives - PDA Displays in Levels of Detail BGMP-01
32Summary
- Dissertation Contributions
- AutoDomainMine Integrating Clustering and
Classification for Estimation AAAI-06 Poster,
ACM SIGARTs ICICIS-05 - LearnMet Learning Domain-Specific Distance
Metrics for Graphs ACM KDDs MDM-05, MTAP-06
Journal - DesRept Designing Semantics-Preserving
Representatives for Clusters ACM SIGMODs
IQIS-06, ACM CIKM-06 - Trademarked Tool for Computational Estimation in
Materials Science ASM HTS-05, ASM HTS-03 - Future Work
- Image Mining, e.g., Comparing Nanostructures
- Data Stream Matching, e.g., Stock Market Analysis
- Visual Displays, e.g., Summarizing Web
Information
33Publications
Dissertation-Related Papers 1. Designing
Semantics-Preserving Representatives for
Scientific Input Conditions, A. Varde, E.
Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman
and R. Sisson Jr., In CIKM, Arlington, VA, Nov
2006. 2. Integrating Clustering and
Classification for Estimating Process Variables
in Materials Science. A. Varde, E. Rundensteiner,
C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson
Jr. In AAAI, Poster Track, Boston, MA, Jul
2006. 3. Effectiveness of Domain-Specific
Cluster Representatives for Graphical Plots. A.
Varde, E. Rundensteiner, C. Ruiz, D. Brown, M.
Maniruzzaman and R. Sisson Jr. In ACM SIGMOD
IQIS, Chicago, IL, Jun 2006. 4. LearnMet
Learning Domain-Specific Distance Metrics for
Plots of Scientific Functions. A. Varde, E.
Rundensteiner, C. Ruiz, M. Maniruzzaman and R.
Sisson Jr. Accepted in the International MTAP
Journal, Springer Publications, Special Issue on
Multimedia Data Mining, 2006. 5. Learning
Semantics-Preserving Distance Metrics for
Clustering Graphical Data. A. Varde, E.
Rundensteiner, C. Ruiz, M. Maniruzzaman and R.
Sisson Jr. In ACM KDD MDM, Chicago, IL, Aug 2005,
pp. 107-112. 6. Apriori Algorithm and
Game-of-Life for Predictive Analysis in Materials
Science. A. Varde, M. Takahashi, E.
Rundensteiner, M. Ward, M. Maniruzzaman and R.
Sisson Jr. In KES Journal, IOS Press,
Netherlands, Vol. 8, No. 4, 2004, pp. 213
228. 7. Data Mining over Graphical Results
of Experiments with Domain Semantics. A. Varde,
E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R.
Sisson Jr. In ACM SIGART ICICIS, Cairo, Egypt,
Mar 2005, pp. 603 611.
34Publications (Contd.)
- 8. QuenchMiner Decision Support for
Optimization of Heat Treating Processes. A.
Varde, M. Takahashi, E. Rundensteiner, M. Ward,
M. Maniruzzaman and R. Sisson Jr. In IEEE IICAI,
Hyderabad, India, Dec 2003, pp. 993 1003. - 9. Estimating Heat Transfer Coefficients as
a Function of Temperature by Data Mining. A.
Varde, E. Rundensteiner, M. Maniruzzaman and R.
Sisson Jr. In ASM HTS, Pittsburgh, PA, Sep 2005. - 10 . The QuenchMiner Expert System for
Quenching and Distortion Control. A. Varde, E.
Rundensteiner, M. Maniruzzaman and R. Sisson Jr.
In ASM HTS, Indianapolis, IN, Sep 2003, pp. 174
183. - Other Papers
- 11. MEDWRAP Consistent View Maintenance over
Distributed Multi-Relation Sources. A. Varde and
E. Rundensteiner. In DEXA. Aix-en-Provence,
France, Sep 2002, pp. 341 350. - 12. SWECCA for Data Warehouse Maintenance. A.
Varde and E. Rundensteiner. In SCI, Orlando, FL,
Jul 2002, Vol. 5, pp. 352 357. - 13. MatML XML for Information Exchange with
Materials Property Data, A. Varde, E. Begley, S.
Fahrenholz-Mann. In ACM KDD DM-SPP, Philadelphia,
PA, Aug 2006. - 14. Semantic Extensions to Domain-Specific
Markup Languages. A. Varde, E. Rundensteiner, M.
Mani, M. Maniruzzaman and R. Sisson Jr. In IEEE
CCCT, Austin, TX, Aug 2004, Vol. 2, pp. 55 60.
35Acknowledgments
- First of all, my Advisor Prof. Elke
Rundensteiner - Committee Prof. Carolina Ruiz,Prof. David Brown,
Prof Neil Heffernan - External Member Prof. Richard D. Sisson Jr.,
Head of Materials Program - Director of Metal Processing Institute Prof.
Diran Apelian - Domain Expert Dr. Mohammed Maniruzzaman
- Members of Center for Heat Treating Excellence
- CS Department Head Prof. Michael Gennert
- Former CS Department Head Prof. Micha Hofri
- WPI Administration (CS, Materials) In particular
Mrs. Rita Shilansky - Reviewers of Conferences and Journals where my
papers got accepted - Members of DSRG, AIRG, KDDRG and Quenching
Research Group - Colleagues and Friends Shuhui, Sujoy, Viren,
Olly, Mariana, Rimma, Maged, Bin, Lydia, Shimin
and others - Great Thanks to my Family Parents Dr. Sharad
Varde and Dr. (Mrs.) Varsha Varde, Grandparents
Mr. D.A. Varde and Mrs. Vimal Varde, Brother
Ameya Varde and Sister-in-law Deepa Varde - All the attendees of my Ph.D. Defense
- Finally, God for guiding me throughout my
doctoral journey
36Thank You