Title: Data Mining Applied to Chemistry and chemical engineering
 1Data Mining Applied to Chemistry and chemical 
engineering
???
- Department of Chemistry, College of Sciences, 
 Shanghai University, P. R. China
2 1 Introduction 1.1 Concept
Data Mining is an analytic process designed to 
explore data in search of consistent patterns 
and/or systematic relationships 
 3between variables, and then to validate the 
findings by applying the detected patterns to new 
subsets of data. 
 41.2 Main Focuses (1) Materials design 
 How to find the best conditions of 
preparation or the structure-property 
relationship of materials, in order to make 
experimental design for new materials preparation 
or to predict the physico-chemical properties of 
unknown materials systems. 
 5(2) Molecular design 
How to find the structure-active relationship of 
molecules, in order to design new compounds with 
expected biological activities or predict the 
physico-chemical properties of unknown molecules. 
 6(3) Industrial optimization 
How to acquire the optimized conditions of 
processing productions, in order to achieve the 
good results of industrial production. 
 72. Methods in MASTER
(1) Optimal map recognition The projection map 
with best separability can be selected out 
according to the rate of correctness for 
classification. 
 8Fig.1 OMR Comparison to PCA
(a) Classification diagram by using Optimal Map 
Recognition (OMR)
(b) Classification diagram by using Pincipal 
Component Analysis (PCA) 
 9(2) Hyper-polyhedron (HP) 
HP Model can be created in such a way that the 
optimal zone can be expressed by a series of 
inequalities to describe the boundaries of two 
types of samples. 
 10Fig.2 Conceptual HP model 
 11(3) Optimal projection regression (OPR) 
-  The OPR method is a quantitative model with the 
 data fusion of regression and Optimal Map
 Recognition (OMR) method. It utilizes the
 information of classification of data set to
 select the most appropriate features for
 regression.
12Fig.3 Conceptual OPR model
Projection from hyperspace to 2-dimensional space
X1
X2 
 13(4) Inverse projection
Fig.4 Projection from 2-dimensional space to 
hyperspace 
 14(5) Hierachical projection model
Fig.5 Conceptual hierachical projection 
 15(6) Support Vector Machine
Support Vector Classification 
 16Support Vector Regression
?????
???????
???? 
?????
???? 
 173 Examples of Application 
3.1 Applications in Materials Design
(1) Optimization of high temperature 
superconductor A nonlinear function based on 5 
terms with the PRESS value of 0.128 was obtained. 
By using inverse projection and OPR method, the 
critical temperature was promoted from 116 K to 
121 K. 
 18Inverse projection result of high temperature 
superconductor 
 19(2) Composition design of rare-earth containing 
phosphor 
By extrapolation we obtained a series of new 
compositions located outside of the scope of 
German patents. Our experimental work confirmed 
that the brightness of these newly designed 
phosphor was higher than those the German patents 
had declared. 
 20Importance of features 
 21Classification diagram using Fisher method 
 22(3) Optimization of VPTC ceramic semiconductors 
By using MASTER, some proposed new composition 
and technological condition of VPTC materials 
gave much better result the ratio of the 
electric resistance at 273K and minimum 
resistance was elevated from 20 to 27.3. 
 23Partial Least Square (PLS) result of VPTC ceramic 
semiconductors 
 24(4) Composition design of cathode materials of 
Ni/H battery 
By using Support Vector Machine (SVM), the 
mathematical models with powerful prediction 
ability had been built, and new formulations were 
predicted and proved by experiments. 
 25Cal. vs Exp. values of C400/C0 
 26(5) Formation condition for amorphous phase of 
ternary fluorides 
By using OMR method, the inequalities obtained 
were used to predict whether a new ternary 
fluoride could form amorphous phase or not. The 
results predicted were in agreement with the 
experimental ones. 
 27OMR result of formation condition for amorphous 
phase of ternary fluorides 
 28(6) Formation condition of ternary intermetallic 
compounds 
Using 2400 known phase diagrams as training set, 
the regularities of formation condition of 
ternary intermetallic compounds were found. A 
series of newly discovered ternary intermetallic 
compounds were predicted in this way with good 
results. 
 29OMR result of formation condition of ternary 
intermetallic compounds 
 30(1) Molecular screening of guanidine compounds 
3.2 Applications in Molecular design
The Hyper polyhedron (HP) and Support Vector 
Classification (SVC) methods were used for the 
computer-aided molecular screening of guanidine 
compounds. It was found that the predicted 
results of HP and SVC were better than those of 
the PCA, KNN and FDV methods etc. 
 31(2) Structure-activity relationship of 
antagonists 
SVC was used to investigate SAR of 26 compounds 
of antagonists. The results of leave-one-out 
cross-validation proved that the prediction 
ability of SVC method was better than those of 
the PCA, KNN and FDV methods etc. 
 32(3) Molecular screening of triazoles compounds 
(1) OMR model was used for the molecular 
screening of new triazoles compounds with 
probable higher anti-fungicidal activities. (2) 
The predicted results of SVC were better than 
those of the PCA, KNN and FDV methods etc. 
 33(4) Structure-property relationship of azo 
dyestuff 
Support Vector Regression (SVR) method was 
employed to predict the absorption maximum 
wavelength of 37 azo dyestuff molecules. The mean 
relative error is 4.22 for the training set and 
4.52 for the predicted set, respectively. 
 343.3 Applications in industrial optimization
(1) Optimization of nitriding technique for 
crankshaft production The problem is that the 
surface hardness of crankshaft products in the 
Factory of Wuxi Diesel Engine was too low. It was 
found that there existed an optimal zone in the 
multidimensional feature space. After 
optimization, the rate of rejection decreased 
from 1.7 to 0.3. 
 35(2) Springback prediction in sheet metal forming 
MASTER combining with FEA software (ANSYS/LS-DYNA 
5.71) was used to predict the springback in 
V-type sheet steel forming. The relative error of 
springback predicted could be controlled within 
10 compared with the experiments. 
 364 Conclusion 
-  (1) MASTER software package is a comprehensive 
 system consisting of orthogonal design,
 statistical analysis, data visualization, pattern
 recognition, regression analysis, artificial
 neural networks (ANN) and support vector machine
 (SVM) etc.
374 Conclusion 
-  (2) MASTER could be used to 
- optimize the formula and technological conditions 
 
- predict the biological activities and 
 physico-chemical properties
- improve the product quality and analyze the fault 
 of processing production.
38Thank you