Title: Considering uncertainties in multivariate curve resolution alternating least squares strategies
1Considering uncertainties in multivariate curve
resolution alternating least squares strategies
- Romà Tauler, rtaqam_at_iiqab.csic.es Department of
Environmental Chemistry. IIQAB CSIC. Jordi
Girona, 18, 08034 Barcelona - Peter Wentzell, peter.wentzell_at_dal.ca
- Department of Chemistry, Dalhousie University,
Halifax, NS B3H 4J3, Canada
2- MOTIVATIONS OF THIS WORK
- The effect of data uncertainties in traditional
alternating least squares strategies in
multivariate curve resolution are investigated. - Examples of application in the resolution of
environmental patterns in contamination studies
considering uncertainties will be given - P.D. Wentzell, T.K. Karakach, S. Roy, M.J.
Martinez, C.P. Allen and M. Werner-Washburne,
"Multivariate Curve Resolution of Time Course
Microarray Data", BMC Bioinformatics, 7, 343
(2006).
3- OUTLINE
- Introduction to Multivariate Curve Resolution by
Alternating Weighted Least Squares (MCR-AWLS) - Testing the MCR-AWLS method
- Application of MCR-AWLS method to resolution and
apportionment of air particulate contamination
sources
4Chemometric data models
Bilinear models for resolution of two way data
J
dij
I
D
dij is the data measurement (response) of
variable j in sample i n1,...,N are the number
of components (species, sources...) cin is the
concentration of component n in sample i snj is
the response of component n at variable j
5An algorithm to solve Bilinear models using
Multivariate Curve Resolution (MCR) Alternating
Least Squares (MCR-ALS)
C and ST are obtained by solving iteratively the
two alternating LS equations
- Optional constraints (local rank,
non-negativity, unimodality,closure,) are
applied at each iteration - Initial estimates of C or ST are needed
6Flowchart of MCR-ALS http//www.ub.es/gesq/mcr/mcr
.htm
Journal of Chemometrics, 1995, 9, 31 2001, 15,
749 Chemomet.Intel. Lab. Systems, 1995, 30, 133
2005, 76, 101 Analytica Chimica Acta, 2003,
500,195-210
ST
D C ST E (bilinear model)
Data Matrix
Resolved Spectra profiles
ALS optimization
SVD or PCA
Initial Estimation
Resolved Concentration profiles
E
D
C
Estimation of the number of components
Initial estimation
ALS optimization CONSTRAINTS
Data matrix decomposition according to a bilinear
model
Results of the ALS optimization procedure Fit
and Diagnostics
7Introduction
Chemometrical Methods
MCR-ALS Quality Assessment
Rotational ambiguities
When two or more components are overlapped (in
the spectra or in the concentrations profiles),
rotational ambiguities appear. Resolved profiles
can be an unknown linear combination of the true
(sought) profiles. If this ambiguity is present,
the concentration and spectra profiles cannot be
represented by a single unique profile. We will
have to represent the profile as a band of
feasible solutions
Tmax
Tmin
Tmax gives the maximum of the feasible band
boundary
Tmin gives the minimum of the feasible band
boundary
- Boundaries of the feasible bands of the MCR-ALS
solutions - can be calculated using constrained non-linear
optimization procedure. - J.of Chemometrics, 2001, 15, 627-646
- Analytica Cimica Acta, 2007. 595 289298
8MCR-ALS Quality Assessment
- Propagation of experimental noise into the
MCR-ALS solutions - Experimental noise is propagated into the MCR-ALS
solutions and - causes uncertainties in the obtained results.
- To estimate these uncertainties for non-linear
models like MCR-ALS - computer intensive resampling methods can be used
Noise added
Mean, max and min profiles Confidence range
profiles
(J. of Chemometrics, 2004, 18, 327340
J.Chemometrics, 2006, 20, 4-67)
9 Including uncertainties in data MCR-AWLS
(Alternating Weighted Least Squares algorithm)
- In some circumstances (e.g. in the analysis of
environmental data tables, microarray data,
etc.), measurement errors can be high and not
uniformly distributed - There is a need of incorporating these
measurement errors into MCR-ALS data analysis to
obtain optimal solutions under these
circumstances - Effect of experimental errors in MCR-ALS
estimations weighting schemes and use of error
in variable methods P.D.Wentzell et al. J.of
Chemometrics, 11 (1997), 339-366 - P.D. Wentzell et al. BMC Bioinformatics, 7 (2006)
343
10MCR-ALS Model and solutions
LS objective function without considering
experimental uncertainties
MCR-ALS unconstrained solutions
11MCR-AWLS Model and solutions
P.D. Wentzell et al. BMC Bioinformatics, 7 (2006)
343
LS objective function considering experimental
uncertainties errors
rows or columns
MCR-AWLS unconstrained solutions
12MCR Model and ALS solutions
Including uncertainties ?i,j
Without including uncertainties
Unconstrained AWLS solution
rows or columns
Unconstrained ALS solution
13Different weighting alternatives in MCR-AWLS
- Traditional MCR-ALS, without weighting, on DPCA
projection or directly on D (experimental) - MCR-AWLS weighting from externally estimated
uncertainties in variables (without correlation) - MCR-AWLS with weights equal to standard
deviations of variables (like in scaling
variables) - MCR-AWLS with weights proportional to variables
intensities - MCR-AWLS weighting recalculated iteratively from
residuals -
- MCR-AWLS weighting using asymetric least squares
principles (i.e. to promote positive or negative
residuals) - .
14- OUTLINE
- Introduction to Multivariate Curve Resolution by
Alternating Weighted Least Squares (MCR-AWLS) - Testing the MCR-AWLS method
- Application of MCR-AWLS method to resolution and
apportionment of air particulate contamination
sources
15- Example Simulation of an environmental two-way
- data set following a bilinear model
- X G FT E
- Environmental Factors G and FT
- Map of variables
- FT(N,NC) ? FT(nr. of sources, nr. of variables)
- Chemical composition of the different sources
- This will identify/define/scribe the major
contamination - sources/patterns
- Map of samples
- G(NR,N) ?G(nr.samples,nr.of sources)
- Distribution/Contribution of the sources on the
samples - This will indicate the geographical/temporal/compa
rtment - Contribution and distribution of contamination
sources defined by FT
16Matrix FT 4 Factor loadings for 50
variables flognrnd(0.01,1,4,50) Map of
variables Composition profiles
Correlation between factor loadings 1.0000
0.6990 -0.0247 -0.0488 0.6990 1.0000
-0.0230 -0.1578 -0.0247 -0.0230 1.0000
0.1781 -0.0488 -0.1578 0.1781 1.0000
Correllation among 1st and 2nd loadings
Each factor has a very positively
skewed distribution of values!!
FT(4,50) loadings (normalized)
17Matrix G 4 Factor scores for 30
samples glognrnd(0.01,1,30,4) Map of
samples Contribution profiles
Correlation between factor scores 1.0000
-0.1027 0.0483 -0.0107 -0.1027 1.0000
0.0900 0.1030 0.0483 0.0900 1.0000
0.1370 -0.0107 0.1030 0.1370 1.0000
Little correlation among scores!!!
G(30,4) scores
18Y E X
lof () 14 R2 98.0 mean(S/N)21.7
Noise structure r 0.01max(max(Y)) 3.21 S
I . r E S . N(0,1)
HOMOCEDASTIC NOISE CASE
SVD Y E X
818.1 348.9 112.9 66.1 37.0
815.2 346.6 104.1 62.9 0.0
39.4 36.6
G FT
19Red max and min bands Blue true FT from
true from pure
20Red max and min bands Blue true G from true
from pure
21 MCR-ALS results quality assesment Data
Fitting - lof - R Profiles recovery -
r2 (similarity) - recovery angles measured by
the inverse cosine ?, expressed in hexadecimal
degrees r2 1 0.99 0.95 0.90 0.80 0.70
0.60 0.50 0.40 0.30 0.20 0.10 0.00 ?
0 8.1 18 26 37 46 53
60 66 72 78 84 90
22No noise and homocedastic noise cases results
recovery angles ?
System init method lof R2 f1 f2 f3 f4
g1 g2 g3 g4 No noise true ALS 0 100 0 0 0
0 0 0 0 0 No noise purest ALS 0 100 1.8 1
1 7.9 5.0 5.9 9.1 13 2.8 max band
- Bands 0 100 3.1 13 7.5 5.5 8.2 18 10 1.7 m
in band - Bands 0 100 2.1 3.7 3.9 3.9 5.2
8.1 14 3.0 Homo noise true ALS 12.6
98.4 3.0 12 8.7 2.1 4.8 12 9.0 2.4 Homo
noise purest ALS 12.6 98.4 3.0 17 8.5 5.0
7.1 12 16 3.7 Homo noise ----- Theor 14.0 98
.0 ---- ---- ---- ---- Homo noise ----- PCA 12.6 9
8.4 ---- ---- ---- ----
23No noise and homocedastic noise cases results
- Only non-negativity and normalization constraints
were used - Data fitting is perfect in the case of no noise,
but solutions are a little different when
different initial estimates are used due to
rotation ambiguity - For environmental profiles rotation ambiguity
effects are present giving recoveries with
recovery angles for the band boundaries always
below 20 degrees - Data fitting in the case of homocedastic noise
reflects noise level, although with a little
tendency to overfit - Rotation ambiguity effects are mixed with
incipient noise propagation effects, giving
slightly worse recoveries than in the case of no
noise, but still within feasible band angle
recoveries below 20 degrees (rotation ambiguities
max/min bands)!!! - PCA also slightly overfits, since PCA fits better
than theoretical even for random noise (lof is
also 12.6 and R2 is 98.4)
24Y E X
lof () 12, 25, 44 R2 99, 94, 80 mean(S/N)
17, 10, 3
HETEROCEDASTIC NOISE CASE Low, Medium, High
random numbers
Noise structure r 5, 10, 20 S r. R(0,1)
(interv 0-1) E S. N(0,1)
Normal Distributed
SVD Y E X
L M H 814 829 823 348 340 347 111
118 154 67 82 135 33 64 130
815 347 104 63 0
L M H 36 71 145 34 69 134
G FT
gtgt
25- Red max and min bands
- Blue true FT
- from true
- from pure
- No Weighting
26- Red max and min bands
- Blue true FT
- from true
- from pure
- weighting
weighting improves recoveries
27- Red max and min bands
- Blue true G
- from true
- from pure
- no weighting
28- Red max and min bands
- Blue true G
- from true
- from pure
- weighting
weighting recovery overall improvement
29Hoterocedastic noise case results
recovery angles ?
System init w lof R2 f1 f2 f3 f4 (Case)
exp exp g1 g2 g3 g4 Hetero noise purest ALS 10
.7 98.8 3.1 14 9.0 3.8 (low) 7.0 10 15 4
.3 Hetero noise purest WALS 12.0
98.6 2.6 12 15 4.3 (low) 7.8 15 15 3.7 T
heoretical ---- ---- 12.0 98.6 ---- ---- ---- --
-- PCA ---- ---- 10.7 98.8 ---- ---- ---- ----
Hetero noise purest ALS 22.3 95.0 7.7 22 22 5.7
(medium) 7.2 21 24 4.5 Hetero
noise purest WALS 24.0 94.2 6.6 22 18 5.7
(medium 7.4 14 17 5.5 Theoretical ---- ----
25.0 93.6 ---- ---- ---- ---- PCA ---- ----
22.0 95.1 ---- ---- ---- ---- Hetero
noise purest ALS 40.0 84.0 12 33 38 10
(high) 15 38 34 9.0 Hetero
noise purest WALS 43.1 81.4 12 26 25 6.0
(high) 5.0 27 16 3.0 Theoretical ---- ---- 44
.2 80.4 ---- ---- ---- ---- PCA ---- ----
40.8 83.4 ---- ---- ---- ----
30Heterocedastic high noise, w0, c2
lof DW lof Dexp R2 Dw R2 Dexp
rmsdif(s) rmsdif(c) Niter flag 12.15
40.0 98.5 84.0
0.00001 0.0139 201 0
31Heterocedastic high noise, w2, c2
lof DW lof Dexp R2 Dw R2 Dexp
rmsdif(s) rmsdif(c) Niter flag 0.92
44.6 99.9916 80.0750 0.0003
0.1385 340 1
32Heterocedastic (non correlated) noise case results
- Low and medium levels of heterocedastic noise
seem not to affect much the parameters (G and FT)
of the bilinear model estimated by ordinary
MCR-ALS (without weighting) - Worse results are obtained by MCR-ALS (without
weighting) for cases where heterocedastic noise
contributions are high. - In these cases the use of the weighting approach
(MCR-AWLS) produces better estimations of the
parameters of the models - Tendency to overfit is observed for unweighted
MCR-ALS and PCA. This problem is only partly
solved by MCR-AWLS. - Further research is needed to check for quality
of residuals in both cases. It is expected a
better behavior of the residuals in the case of
MCR-AWLS.
33- OUTLINE
- Introduction to Multivariate Curve Resolution by
Alternating Weighted Least Squares (MCR-AWLS) - Testing the MCR-AWLS method
- Application of MCR-AWLS method to resolution and
apportionment of air particulate contamination
sources
34Figure 1 Geographical location of Llodio site
and plot of samples taken during the whole year
2001
35SO42-
NH4
Ctotal
NO3-
Ca
Fe
Cl-
Zn
Na
Al2O3
36 Plot of variables and uncertainties
SO42-
Raw data
scaling
Ctotal
NH4
Zn
Uncertainties (a proportional and a constant part)
scaling
37 Principal Components Analysis Model Developed
10-Jul-2007 160340.39 X-block xscaled 87 by
34 Included 1-87 1-34 Preprocessing
None Num. PCs 6 Cross validation random samples
w/ 9 splits and 20iterations RMSEC 0.540988
RMSECV 4.90579 Percent Variance
Captured by PCA Model Principal Eigenvalue
Variance Variance Component
of Captured Captured Number
Cov(X) This PC Total ---------
---------- ---------- ---------- 1
6.81e001 72.79 72.79
2 5.80e000 6.21
79.00 3 3.42e000 3.66
82.65 4 2.43e000 2.60
85.25 5 2.04e000
2.18 87.43 6 1.68e000
1.80 89.23
38blue ALS R2 99.3, black AWLS R2 88.1 (raw data)
Ctotal
SO42-
Ctotal
NH4
Cl-
Na
SO42-
Zn
SO42-
Fe
Zn
39blue ALS, black AWLS (scaled)
Crustal
Steel
Ctotal
Sn
Ca
Sr
As
Zn
Fe
Cd
Pb
Ba
Ti
Na
Rb
Mg
Mo
Mg
Mn
K
Ge
Valley
Na Mg
SO42-
Ctotal
Cl-
Tl
K
Sr
Marine
NH4
Sn
Ca
La
Ti
Co
Cr,Mn,Co, Ni,Cu
Pigment
Se
Mo
As
Ctotal
Traffic
Fe
Ctotal
Sn
Ca, K
Ba
K
Cd
40Loadings comparison (correlation) considering
different methods
41blue ALS, black AWLS (scaled) Scores
MCR-ALS fails
?
42Scores comparison (correlation) considering
different methods
43Conclusions Llodio experimental data
- Scaling as a data pretreatment allows
distinguishing better minor contributions but
then, uncertainties weighting or not has a larger
effect and becomes more critical - For major contamination sources, conclusions are
similar, either uncertainties and weighting are
considered or not - Including uncertainties have more important
effects for the interpretation of minor
contamination sources, specially if these
uncertainties are large
44- ACKNOWLEDGEMENTS
- P. Wentzell, Dalhousie University University
- Llodio data, M. Viana and X. Querol from IJA-CSIC
- Research project CTQ-15052-C02-01
45- Recent advances on MCR-ALS method
- Hybrid soft- hard- (grey) bilinear models
(kinetic and equilibrium chemical - reactions, profile responses shape...)
- Extension of MCR-ALS to multiway data analysis
(MA-MCR-ALS including PARAFAC, Tucker3 and mixed
models....) - Spectroscopic Image Analysis.using MCR-ALS
- Calculation of feasible band boundaries (rotation
ambiguity) - Error propagation in MCR-ALS solutions
- Alternating Weighted Least Squares (AWLS)
-
- Applications
- Environmental contamination sources resolution
and apportionemnt - Bioanalytical polynucleotides, proteins, DNA
u-array... - Analytical Hyphenated methods(LC-DAD, LC-MS,
GC-MS, FIA-DAD,), multidimensional
spectroscopies (2D-NMR, EEM , - On-line spectroscopic monitoring of
(bio)chemical processes and reactions...... - .
- New user interface http//www.ub.es/gesq/mcr/mcr.
htm - J. Jaumot,et al., Chemometrics and Intelligent
Laboratory Systems, 2005, 76(1) 101-110