Title: Predicting Dominant Phytoplankton Quantities in Reservoirs Using PCA Based Artificial Neural Network
1Predicting Dominant Phytoplankton Quantities in
Reservoirs Using PCA Based Artificial Neural
Networks
- Ersin Kivrak , Hasan Gürbüz , Hürevren Kiliç and
Selçuk Soyupak
2Titles
- Introduction
- Materials and methods
- Demirdöven Reservoir
- Sampling and measurements
- Data analysis and modeling methods
- Results and Discussions
- Conclusions
3Introduction-Limnological Data Bases
- Limnological data bases include
- Water quality information
- Nutrients ( P and N compounds)
- Cations (Na, K, Mg, Ca)
- Anionic properties (SO4)
- Acidity, alkalinity , pH, Temperature ,
Conductivity, Dissolved Oxygen - Light penetration properties
- location (depth) and time
- Primary productivity ( Cholorophyll-a)
- Phyroplanktonic and zooplanktonic structure
(Species counts and dominance)
4Introduction-CaseLimnological Data Base for DDR
- T, DO, pH, EC, 20 cm diameter Secchi Disc
transparency (SD), PO4-P, NH4-N, NO2-N, NO3-N,
K, Na, Ca, Mg, SO4--, Cl- and HCO3-, Alk. - Primary production ( Chl-a) and
- Phytoplankton species counts for families.
- Bacillariophyta
- Chlorophyta
- Euglenophyta
- Dinophyta
- Type of data snap-shot day time data with
monthly intervals
5Introduction-Purpose of the research
- Main purpose of this specific research
- Quantification of
- Responses (Y) Primary production ( Chl-a),
- Total phytoplankton counts, and
- Dominant Phytoplankton species counts from the
families. - Bacillariophyta ,Chlorophyta,Euglenophyta,
Dinophyta - utilizing
- The predictors(X)
- T, DO, pH, EC, 20 cm diameter Secchi Disc
transparency (SD), PO4-P, NH4-N, NO2-N, NO3-N,
K, Na, Ca, Mg, SO4--, Cl- and HCO3-, Alk.
6Introduction-Purpose of the research
- The problem identification
- The number of independent and dependent variables
(X) were many, - They displayed multicollinearity.
- The selections of predictors in primary
production quantification were generally based on
the data availability and common expert opinion
rather than quantitative arguments based on the
structures of predictor and response matrices.
- We have investigated the possibilities of
developing a systematic approach for efficient
interpretation and utilization of the snap-shot
day-time data with monthly intervals for
quantification of primary productivity in a
reservoir (rather than expert judgement) .
7Materials and methods(1)The selected water body
- Water body was DDR
- A relatively small reservoir
- A reservoir for irrigation
- Start of operation 1995.
- The max. operation level The surface area1.45
square km 2 - Useful volume 44.5x106 cubic m
8(No Transcript)
9Materials and methods(2)The selected water body
Table 1. The descriptive statistical summary for
predictors
10Materials and methods(3)The selected water body
Table 1. The descriptive statistical summary for
predictors and responses
11Table 2. The frequencies of phytoplankton
species in percentages (i.e. number of samples
where the species was found/ total number of
samples). Total number of samples16.
12Materials and methods(5)The selected water body
- The trophic state of the reservoir
- oligo-mesotrophic as indicated by
- primary productivity levels,
- the dominant phytoplankton species
- phytoplankton distribution and succession,
- the levels of nutrient concentrations, and
- the level of anthropogenic activities in the
catchment area .
13Materials and methods(6)Available data matrices
- The data
- Monthly from April to November (within years 2000
and 2001). - December to March Ice cover period.
- (X) an input matrix (Predictor matrix)?11218
- 112 pieces of observations for each of the
eighteen predictor variables (time (Ti), D, T,
DO, Alk, pH, EC, PO4-P, NH4-N, NO2-N, NO3-N, K,
Na, Ca, Mg, SO4--, Cl- and HCO3 ). - (Y)Response matrix of column vectors. 1127
- The target variables were total phytoplankton,
Bacillariophyta, Chlorophyta, dominating
phytoplankton species (Sphaerocystis schroeteri,
Staurastrum longiradiatum, Cyclotella ocellata),
and Chl-a concentrations.
14Materials and methods(6)Quantificaton Approaches
- Base-line Feedforward MLP ANN models
- Base-line Feedforward MLP ANN models (Double
layer) - Base-line Feedforward MLP ANN models (Single
layer) - MLP ANN Models with PREPCA applications
- Elman Network ANN Models
- Partial least squares
15Materials and methods(7) Data analysis and
modeling methods-ANNs(1)
- Pre-processing
- Randomization
- Normalization To increase the efficiency of
training, the network inputs and targets were
scaled by normalizing the input and target values
so that they have a zero mean and a unity
standard deviation. - Data divison ½ of the data has been utilized
for training, ¼ for validation and ¼ for testing.
16Materials and methods(8) Data analysis and
modeling methods-ANNs(2)
- Types of ANN Models
- Base-line Feedforward MLP ANN models all
predictors in model development with double layer
structure (5 neurons in each layer). - Base-line Feedforward MLP ANN models all
predictors in model development with single
layer structure ( 10 neurons). - MLP ANN Models with PREPCA applications Some
additional ANN models were developed after
reducing the dimension of input matrix by using
the method PREPCA ( by eliminating the components
that contribute less than 5 to the total
variation in the data set.) - Elman Network ANN Models
17Materials and methods(9) Data analysis and
modeling methods-ANNs(2)
- Training methods
- 1)The BFGS algorithm based on quasi-Newton (or
secant) method was employed in training MLP NNs.
The method is based on Newtons method of
optimization that employs the basic step given by
Equation - x k1 xk -Ak-1
gk - Ak the second derivatives (or Hessian matrix)
of the performance index at the current values of
the weights and biases, -
- Xk a vector of current weights and biases,
- gk the current gradient. Quasi-Newton method
eliminates the calculation of second derivatives,
however updates an approximate Hessian Matrix at
each iteration of algorithm. The update is
computed as a function of gradient.
18Materials and methods(10) Data analysis and
modeling methods-ANNs(2)
- Training methods
- 2)As an alternative Elman Network .
- Elman Networks differ from the conventional
two-layer feed forward MLP ANN models in that the
first layer has recurrent connection. - The addition of a feedback connection from the
output of the hidden layer to its input enables
it to recognize both temporal and spatial
patterns. - On the other hand, it is known fact that for
best chance at the learning problem it needs more
hidden neurons in its hidden layer than the other
models.
19Materials and methods(11) Data analysis and
modeling methods-ANNs(3)
- Cross-validation studies
- When training was complete the simulation
results were de-normalized by reversing the
action both for applications with or without
PREPCA. - Elimination of the over-fitting was attempted
to be eliminated by method of regularization. - R and RMSE for performance and model precision
evaluation. -
20Materials and methods(12)Data analysis and
modeling methods-PLS(1)
- Since the data have multicollinearity problem we
have considered PLS method as an alternative
predictive tool to MLP ANNs. - PLS Predictive technique to handle many
independent variables, even when these display
multicollinearity. - Advantages
- handles multiple dependents as well as multiple
independents. - Ability to handle multicollinearity among the
independents - Creating independent latents directly on the
basis os cross products involving the response
variables - Making stronger predictions
- Disadvantage
- Difficulty of interpreting the loadings of
independent latent variables
21Materials and methods(13)Data analysis and
modeling methods-PLS(2)
- Partial least squares (PLS)
- PLS method has been applied as an alternative
approach due to following basic reasons - 1) The predictor matrix had 18 components that
can be considered high in number, - 2) Some of these components were correlated,
and finally - 3) The responses were many (in our case it is 7)
and some were also correlated. -
22Materials and methods(14)Data analysis and
modeling methods-PLS(3)
- PLS regression specifically examines both
predictor and response matrices to find latent
vectors that explain as much as possible of the
co-variance between predictors and responses. - The adopted PLS algorithm generates a sequence of
models, where each consecutive model contains one
additional component. - Cross-validation step determines the number of
components that minimizes the prediction error
during this research, omitting one observation
at-a time methodology has been adopted. The
utilized algorithm selects the model with the
number of components that produces the highest
predicted correlation coefficient (R).
23Materials and methods(14)Data analysis and
modeling methods-PLS(4) Model fitting
- The nonlinear iterative partial least squares
(NIPALS) algorithm developed by Herman Wold have
been adopted. - PLS reduces the number of predictors by
extracting uncorrelated components based on the
covariance between the predictor and response
variables Frank and Kowalski . - The PLS algorithm produces a sequence of models,
where each consecutive model contains one
additional component. Components are calculated
one at a time, starting with the standardized x-
and y-matrix. Subsequent components are
calculated from the x- and y-residual matrix
iterations stop upon reaching the maximum number
of components or when x-residuals become the zero
matrix. - Cross-validation is used to identify the number
of components that minimizes prediction error.
24 RESULTS CONCLUSIONS
25DOMINATING SPECIES
26Table 2. The frequencies of phytoplankton
species in percentages
27Conclusions-MLP ANNs(1)
- Feed-forward MLP back propagation ANNs utilizing
quasi-Newton algorithm with full predictor matrix
have yielded best results for variables
Bacillariophyta, Chlorophyta, Staurastrum
longiradiatum. - double layer structures have yielded better
results as compared to single layer structure
utilizing the same number of neurons. - ANN models with PREPCA applications did not bring
additional precision or better performance as
compared to standard ANN modeling with full
predictor matrices since 14 of 18 predictors have
been identified to be responsible for 95 of
variability after an eigen-value analysis.
28Conclusions-MLP ANNs(2)
- The signs and the magnitudes of the coefficients
of principle components (PCs) indicate that the
all the variables of the input matrix have
different effect levels on different PCs,
therefore their inclusions are justifiable.
29Feed-forward MLP ANNs double hidden layer 5-5-1
30.
Feed-forward MLP ANNs x validation study
double hidden layer 5-5-1
31 Eigenvalue structure of principal components of
input data matrix
32Figure The cefficients for principal component
scores for three input variables and 10 PCs
33Figure The cefficients for principal component
scores for three input variables and 10 PCs
34Results and Conclusions-PLS
- Simplistic statistical PLS analysisdelineates
the relative importance of predictors on the
quantities of responses. - PLS modeling using each time a single response
gave better results as compared to multivariate
(multi response) modeling approach. -
- PLS modeling approach with single response
modeling has yielded best results for four of the
variables Log10 (Total phytoplankton count),
Chl-a, Cyclotella ocellata and Sphaerocystis
schroeteri.
35Table . The first five principal components for
each response considering absolute magnitudes of
standardized regression coefficients
36(No Transcript)
37(No Transcript)
38Rankings of models based on performance and
precision criteria
39Table . Rankings of models based on performance
and precision criteria
40Table . Rankings of models based on performance
and precision criteria
41Thank you for your time and patience!
- Comments and Suggestions are welcome!
42PLS (Additional-1)
- Predictive technique to handle many independent
variables, even when these display
multicollinearity. - Avantages
- handles multiple dependents as well as multiple
independents. - Ability to handle multicollinearity among the
independents - Creating independent latents directly on the
basis on cross products involving the response
variables - Making stronger predictions
- Disadvantage
- Difficulty of interpreting the loadings of
independent latent variables
43PLS (Additional-2)
- Cross-validation procedure
- For each potential model,
- 1 Omit one observation or group of
observations, depending on the cross-validation
method you use. - 2 Recalculate the model without the
observation/group of observations. - 3 Predict the response, or the cross-validated
fitted value, for the omitted observation/group
of observations using the recalculated model and
calculates the cross-validated residual value. - 4 Repeat steps 1-3 until all observations have
been omitted and fit. - 5 Calculate the prediction sum of squares
(PRESS) and predicted R2 values.
44PLS (Additional-3)
- After performing steps 1-5 for each model,
Minitab selects the model with the number of
components that produces the highest predicted R2
and lowest PRESS. With multiple response
variables, Minitab selects the model with the
highest average predicted R2 and lowest average