A multivariate regression model relates a property of interest y, such as a concentration, to indepe - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

A multivariate regression model relates a property of interest y, such as a concentration, to indepe

Description:

For instance, samples A and B are prediction outliers since they do not fit into ... [1] - C. L. Stork, B.R. Kowalski, Weighting schemes for updating regression ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 2
Provided by: Fabi78
Category:

less

Transcript and Presenter's Notes

Title: A multivariate regression model relates a property of interest y, such as a concentration, to indepe


1
The weighting of samples in multivariate
regression model updating
X. Caprona, B. Walczaka,b, O.E. de Noordc, D.L.
Massarta a ChemoAC, Pharmaceutical Institute,
Vrije Universiteit Brussel, Laarbeeklaan 103,
B-1090 Brussels, Belgium b On leave from Silesian
University, 9 Szkolna Street, 40-006 Katowice,
Poland c Shell International Chemicals B.V.,
Shell Research and Technology Centre, Amsterdam,
P.O. Box 38000, 1030 BN Amsterdam, The Netherlands
Why updating a regression model ?
The initial PLS regression model built on Cluster
I has an optimal complexity of 4 factors and the
errors of prediction assessed on the independent
test sets before the update are RMSEP00.17
RMSEPnew3.25 RMSEPnew for new incoming
samples is unacceptable and underlines the need
for an update of the model (see Fig. 2).
A multivariate regression model relates a
property of interest y, such as a concentration,
to independent variables X. Mathematically, a
regression model can be formulated as
y X.b e
where b is the regression vector and e is the
vector of residuals, i.e. the part of y not
explained by the model. However, a model is only
valid for samples well represented by the data
used during the calibration step, i.e. for
samples fitting to the calibration domain of the
model.
Fig. 2. y predicted vs y measured for the initial
PLS model. () stands for test samples falling
into the original calibration domain (cluster I),
and (o) represents test samples carrying new
sources of variance (cluster II).
sample with high X residuals
For instance, samples A and B are prediction
outliers since they do not fit into the
calibration domain. As a consequence the model
should not be used to predict their property of
interest, yA and yB. To predict the properties of
interest of such outlying samples, the model must
be updated to deal with the specificities of
these particular objects.
B
How many samples ? How to select them ?
calibration domain
First, the influence on RMSEP0 and RMSEPnew of
the number of samples added to the calibration
set is studied. For each selection method, 1 to 8
samples are added to the calibration set without
any weighting, and the predictive ability of the
model is evaluated. Evolution of RMSEPnew (Fig.
3) shows that very few samples are necessary to
perform an efficient update of the model.
Whatever the selection method, 3 samples are
enough to decrease RMSEPnew from 3.25 down to 0.3
approximately. Adding more samples does not
improve results further. The Kennard and Stone
and the Duplex algorithm deliver similar
performance, while the selection of samples
according to their Mahalanobis distance yields
less satisfying update for one or two samples.
Therefore, representativity of samples is to be
preferred to the selection of most extreme
samples.
A
sample with high leverage
Sample A has a high leverage. This kind of
samples do not carry uncalibrated spectral
features, i.e. new sources of variance. However,
it is impossible to extrapolate the regression
model without danger. If the model is slightly
non linear outside the calibration domain, an
updating of the model is possible. If the non
linearity is too strong, then a local model
should be developped.
Fig. 3. RMSEPnew vs number of added samples. (o),
(?), (?) represent results for samples selected
by the KS algorithm, the Duplex algorithm and
according to the MD of samples from the initial
calibration set, respectively.
The weighting scheme
Sample B has high X residuals, which are due to
uncalibrated spectral features, resulting from
the presence of uncalibrated sources of variance.
Two different approaches are possible If the
new spectral features are important to predict
yB, the new information carried must be included
into the model, i.e. the latter must be updated.
If the new information is not related to the
property of interest, either the influence of the
disturbance is corrected by means of
preprocessing, or this new variability is
calibrated, which requires an update of the
model. Unfortunately, correction of disturbances
is a very difficult problem, and the update
should be considered as the first choice in this
situation.
The possibility of using the weighting scheme to
update a model with fewer samples is
investigated, i.e. the predictive ability of a
model updated with only one or two samples, which
are given higher weight, is compared with the
predictive ability of a model updated with more
samples. In the same way, it is investigated if
the weighting of samples can improve the
predictive power of a model updated with a
sufficient number of samples, i.e. 3 samples in
this case.
Conclusion To predict the property of interest
of samples A and B, an update of the model is
required.
How to perform the update ?
Model updating consists of including into the
model the new sources of data variance, i.e.
adding some outlying samples Xnew, to the old
calibration set X0.
Fig. 4. RMSEPnew as a function of the weight
applied to a) one sample c) three samples added
to the calibration set. (o), (?), (?) represent
results for new samples selected by the KS
algorithm, the Duplex algorithm and according to
their MD from the initial calibration set,
respectively.
Results from Fig. 4 show that the weighting
scheme does not seem useful in the context of
updating. Independently of the selection method,
increasing the weight of one sample does not have
any significant influence on RMSEPnew. When 3
samples are weighted, no clear tendancy can be
observed. Although the weighting improve
perceptibly the error of prediction with the
Duplex approach, it is clearly not useful when
samples are selected with the Kennard and Stone
algorithm, or depending on their distance from
the initial calibration set.
the calibration set before and after the update
procedure
The problem of validation
  • Two constraints have to be fulfilled to perform
    an efficient update. First, the error of
    prediction for samples fitting into the
    calibration domain (RMSEP0) should not be
    significantly higher after the update than
    before. Second, the error of prediction for new
    incoming samples (RMSEPnew) should be decreased
    as much as possible. In the same way, updating a
    model by recalibration with few new samples
    raises questions that must be carefully
    investigated
  • How many samples are necessary ? It might be
    expensive and time consuming to collect and
    analyse new samples for the update. Therefore,
    the number of samples necessary to achieve an
    efficient update must be optimised.
  • How to select the samples ?Most of the time,
    only the spectra of new samples are available,
    and the selection of the samples carrying new
    sources of variance must rely only on this
    information. Three different selection approaches
    are investigated, namely the Kennard and Stone
    algorithm, the Duplex algorithm, and a method
    based on the Mahalanobis distance. The two first
    approaches are known to select representative
    samples from a data set, while the third one
    selects the most different samples from the
    initial calibration set.
  • Is weighting of samples necessary or helpful ?
    Because only few samples are added to the
    initial calibration set, it seems logical to
    balance the importance of old and new data by
    giving a higher weight to new samples. The
    approach investigated here is the weighting
    scheme proposed by Stork and Kowalski 1, in
    which several copies of the selected samples are
    included in the data set.
  • How to validate the updated model ? The expected
    error of prediction must be assessed again after
    the update. For PCR and PLS models, the
    complexity, i.e. the number of factors necessary,
    has to be optimised.

For PCR and PLS regression approaches, the
optimal number of latent variables is usually
estimated by cross validation. When only few
samples are used to update the model, this
approach might not be useful. Indeed, if one of
the updating samples is left out, the resulting
model will give unacceptable error of prediction
for these samples, resulting in an unreliable
cross validation error. Alternative approaches
such as the Akaike information criterion 2, the
ICOMP criterion3, or the H-error 4 have to be
investigated. In practice, no samples are
available to estimate RMSEPnew, and hence to
assess the predictive ability of the updated
model. This issue is important since a model
should never be used if its expected error of
prediction is unknown. However, no solution is
available up to now, and this problem has to be
studied in more detail.
The updated model
After the addition to the calibration set of 3
unweighted new samples selected with the Kennard
and Stone algorithm, an efficient update is
achieved. The complexity of the new model is
estimated at 4 PLS factors, and the error of
prediction estimated on the independent test sets
are RMSEP00.22 RMSEPnew0.28 The
error of prediction for samples falling into the
initial calibration set slightly increased from
0.17 to 0.22. Nonetheless, the improvement of 91
observed for RMSEPnew justify the update of the
model.
Fig. 5. y predicted vs y measured for the updated
PLS model. () and (o) stand for test samples
inside and outside of the initial calibration
domain, respectively.
The data set
A data set of NIR spectra is used to investigate
updating of a PLS regression model. Independent
variables are the near-infrared spectra of
polyols while the dependent variable is the
corresponding hydroxyl number The data are split
into two main clusters (see Fig. 1) . Cluster I
is split into 75 samples used to calibrate the
initial PLS model, and 22 test samples used to
evaluate RMSEP0, while Cluster II is split into
40 samples available for the PLS model update,
and 31 test samples used to evaluate the error of
prediction for new incoming samples, RMSEPnew
Conclusions
  • Few samples are necessary to update efficiently
    an existing regression model
  • Samples should be selected according to their
    representativity of the new source of variance
  • The weighting scheme does not appear to be
    useful in the updating context
  • Validation of the updated model is problematic
    and should be more investigated

1 - C. L. Stork, B.R. Kowalski, Weighting
schemes for updating regression models-a
theoretical approach, Chemometrics and
Intelligent Laboratory Systems, 48 (1999)
151-166 2 - H. Akaike, Information theory and
an extension of the maximum likelihood principle,
2nd Internat. Symp. on Inform. Theory (Akademia
Kiado, Budapest) (1970) pp. 267-281. 3 -
Bozdogan, H. ICOMP A new model selection
criterion, Classification and Related Methods of
Data Analysis, North-Holland, Amsterdam, pp.
599-608. 4 A. Hoskuldsson, The H-principle in
modelling with applications to chemometrics,
Chemometrics and Intelligent Laboratory Systems,
14 (1992) 139-153
Fig. 1. PLS scores, factor 1 vs factor 2. () and
(x) stand for calibration samples and samples
available for the update, () and (o) are
independent test samples used to compute RMSEP0
and RMSEPnew, respectively.
Write a Comment
User Comments (0)
About PowerShow.com