Predicting Dominant Phytoplankton Quantities in Reservoirs Using PCA Based Artificial Neural Network - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Predicting Dominant Phytoplankton Quantities in Reservoirs Using PCA Based Artificial Neural Network

Description:

Ersin Kivrak , Hasan G rb z , H revren Kili and Sel uk Soyupak. 9/20/09. 2 ... NO2-N, NO3-N, K , Na , Ca , Mg , SO4--, Cl- and HCO3-, Alk. ... Alk. Input ... – PowerPoint PPT presentation

Number of Views:132

Avg rating:3.0/5.0

Slides: 45

Provided by: exp6

Category:

more less

Transcript and Presenter's Notes

Title: Predicting Dominant Phytoplankton Quantities in Reservoirs Using PCA Based Artificial Neural Network

1
Predicting Dominant Phytoplankton Quantities in
Reservoirs Using PCA Based Artificial Neural
Networks

Ersin Kivrak , Hasan Gürbüz , Hürevren Kiliç and
Selçuk Soyupak

2
Titles

Introduction
Materials and methods
Demirdöven Reservoir
Sampling and measurements
Data analysis and modeling methods
Results and Discussions
Conclusions

3
Introduction-Limnological Data Bases

Limnological data bases include
Water quality information
Nutrients ( P and N compounds)
Cations (Na, K, Mg, Ca)
Anionic properties (SO4)
Acidity, alkalinity , pH, Temperature ,
Conductivity, Dissolved Oxygen
Light penetration properties
location (depth) and time
Primary productivity ( Cholorophyll-a)
Phyroplanktonic and zooplanktonic structure
(Species counts and dominance)

4
Introduction-CaseLimnological Data Base for DDR

T, DO, pH, EC, 20 cm diameter Secchi Disc
transparency (SD), PO4-P, NH4-N, NO2-N, NO3-N,
K, Na, Ca, Mg, SO4--, Cl- and HCO3-, Alk.
Primary production ( Chl-a) and
Phytoplankton species counts for families.
Bacillariophyta
Chlorophyta
Euglenophyta
Dinophyta
Type of data snap-shot day time data with
monthly intervals

5
Introduction-Purpose of the research

Main purpose of this specific research
Quantification of
Responses (Y) Primary production ( Chl-a),
Total phytoplankton counts, and
Dominant Phytoplankton species counts from the
families.
Bacillariophyta ,Chlorophyta,Euglenophyta,
Dinophyta
utilizing
The predictors(X)
T, DO, pH, EC, 20 cm diameter Secchi Disc
transparency (SD), PO4-P, NH4-N, NO2-N, NO3-N,
K, Na, Ca, Mg, SO4--, Cl- and HCO3-, Alk.

6
Introduction-Purpose of the research

The problem identification
The number of independent and dependent variables
(X) were many,
They displayed multicollinearity.
The selections of predictors in primary
production quantification were generally based on
the data availability and common expert opinion
rather than quantitative arguments based on the
structures of predictor and response matrices.
We have investigated the possibilities of
developing a systematic approach for efficient
interpretation and utilization of the snap-shot
day-time data with monthly intervals for
quantification of primary productivity in a
reservoir (rather than expert judgement) .

7
Materials and methods(1)The selected water body

Water body was DDR
A relatively small reservoir
A reservoir for irrigation
Start of operation 1995.
The max. operation level The surface area1.45
square km 2
Useful volume 44.5x106 cubic m

8
(No Transcript)
9
Materials and methods(2)The selected water body
Table 1. The descriptive statistical summary for
predictors
10
Materials and methods(3)The selected water body
Table 1. The descriptive statistical summary for
predictors and responses
11
Table 2. The frequencies of phytoplankton
species in percentages (i.e. number of samples
where the species was found/ total number of
samples). Total number of samples16.
12
Materials and methods(5)The selected water body

The trophic state of the reservoir
oligo-mesotrophic as indicated by
primary productivity levels,
the dominant phytoplankton species
phytoplankton distribution and succession,
the levels of nutrient concentrations, and
the level of anthropogenic activities in the
catchment area .

13
Materials and methods(6)Available data matrices

The data
Monthly from April to November (within years 2000
and 2001).
December to March Ice cover period.
(X) an input matrix (Predictor matrix)?11218
112 pieces of observations for each of the
eighteen predictor variables (time (Ti), D, T,
DO, Alk, pH, EC, PO4-P, NH4-N, NO2-N, NO3-N, K,
Na, Ca, Mg, SO4--, Cl- and HCO3 ).
(Y)Response matrix of column vectors. 1127
The target variables were total phytoplankton,
Bacillariophyta, Chlorophyta, dominating
phytoplankton species (Sphaerocystis schroeteri,
Staurastrum longiradiatum, Cyclotella ocellata),
and Chl-a concentrations.

14
Materials and methods(6)Quantificaton Approaches

Base-line Feedforward MLP ANN models
Base-line Feedforward MLP ANN models (Double
layer)
Base-line Feedforward MLP ANN models (Single
layer)
MLP ANN Models with PREPCA applications
Elman Network ANN Models
Partial least squares

15
Materials and methods(7) Data analysis and
modeling methods-ANNs(1)

Pre-processing
Randomization
Normalization To increase the efficiency of
training, the network inputs and targets were
scaled by normalizing the input and target values
so that they have a zero mean and a unity
standard deviation.
Data divison ½ of the data has been utilized
for training, ¼ for validation and ¼ for testing.

16
Materials and methods(8) Data analysis and
modeling methods-ANNs(2)

Types of ANN Models
Base-line Feedforward MLP ANN models all
predictors in model development with double layer
structure (5 neurons in each layer).
Base-line Feedforward MLP ANN models all
predictors in model development with single
layer structure ( 10 neurons).
MLP ANN Models with PREPCA applications Some
additional ANN models were developed after
reducing the dimension of input matrix by using
the method PREPCA ( by eliminating the components
that contribute less than 5 to the total
variation in the data set.)
Elman Network ANN Models

17
Materials and methods(9) Data analysis and
modeling methods-ANNs(2)

Training methods
1)The BFGS algorithm based on quasi-Newton (or
secant) method was employed in training MLP NNs.
The method is based on Newtons method of
optimization that employs the basic step given by
Equation
x k1 xk -Ak-1
gk
Ak the second derivatives (or Hessian matrix)
of the performance index at the current values of
the weights and biases,
Xk a vector of current weights and biases,
gk the current gradient. Quasi-Newton method
eliminates the calculation of second derivatives,
however updates an approximate Hessian Matrix at
each iteration of algorithm. The update is
computed as a function of gradient.

18
Materials and methods(10) Data analysis and
modeling methods-ANNs(2)

Training methods
2)As an alternative Elman Network .
Elman Networks differ from the conventional
two-layer feed forward MLP ANN models in that the
first layer has recurrent connection.
The addition of a feedback connection from the
output of the hidden layer to its input enables
it to recognize both temporal and spatial
patterns.
On the other hand, it is known fact that for
best chance at the learning problem it needs more
hidden neurons in its hidden layer than the other
models.

19
Materials and methods(11) Data analysis and
modeling methods-ANNs(3)

Cross-validation studies
When training was complete the simulation
results were de-normalized by reversing the
action both for applications with or without
PREPCA.
Elimination of the over-fitting was attempted
to be eliminated by method of regularization.
R and RMSE for performance and model precision
evaluation.

20
Materials and methods(12)Data analysis and
modeling methods-PLS(1)

Since the data have multicollinearity problem we
have considered PLS method as an alternative
predictive tool to MLP ANNs.
PLS Predictive technique to handle many
independent variables, even when these display
multicollinearity.
Advantages
handles multiple dependents as well as multiple
independents.
Ability to handle multicollinearity among the
independents
Creating independent latents directly on the
basis os cross products involving the response
variables
Making stronger predictions
Disadvantage
Difficulty of interpreting the loadings of
independent latent variables

21
Materials and methods(13)Data analysis and
modeling methods-PLS(2)

Partial least squares (PLS)
PLS method has been applied as an alternative
approach due to following basic reasons
1) The predictor matrix had 18 components that
can be considered high in number,
2) Some of these components were correlated,
and finally
3) The responses were many (in our case it is 7)
and some were also correlated.

22
Materials and methods(14)Data analysis and
modeling methods-PLS(3)

PLS regression specifically examines both
predictor and response matrices to find latent
vectors that explain as much as possible of the
co-variance between predictors and responses.
The adopted PLS algorithm generates a sequence of
models, where each consecutive model contains one
additional component.
Cross-validation step determines the number of
components that minimizes the prediction error
during this research, omitting one observation
at-a time methodology has been adopted. The
utilized algorithm selects the model with the
number of components that produces the highest
predicted correlation coefficient (R).

23
Materials and methods(14)Data analysis and
modeling methods-PLS(4) Model fitting

The nonlinear iterative partial least squares
(NIPALS) algorithm developed by Herman Wold have
been adopted.
PLS reduces the number of predictors by
extracting uncorrelated components based on the
covariance between the predictor and response
variables Frank and Kowalski .
The PLS algorithm produces a sequence of models,
where each consecutive model contains one
additional component. Components are calculated
one at a time, starting with the standardized x-
and y-matrix. Subsequent components are
calculated from the x- and y-residual matrix
iterations stop upon reaching the maximum number
of components or when x-residuals become the zero
matrix.
Cross-validation is used to identify the number
of components that minimizes prediction error.

24
RESULTS CONCLUSIONS
25
DOMINATING SPECIES
26
Table 2. The frequencies of phytoplankton
species in percentages
27
Conclusions-MLP ANNs(1)

Feed-forward MLP back propagation ANNs utilizing
quasi-Newton algorithm with full predictor matrix
have yielded best results for variables
Bacillariophyta, Chlorophyta, Staurastrum
longiradiatum.
double layer structures have yielded better
results as compared to single layer structure
utilizing the same number of neurons.
ANN models with PREPCA applications did not bring
additional precision or better performance as
compared to standard ANN modeling with full
predictor matrices since 14 of 18 predictors have
been identified to be responsible for 95 of
variability after an eigen-value analysis.

28
Conclusions-MLP ANNs(2)

The signs and the magnitudes of the coefficients
of principle components (PCs) indicate that the
all the variables of the input matrix have
different effect levels on different PCs,
therefore their inclusions are justifiable.

29
Feed-forward MLP ANNs double hidden layer 5-5-1

30
.
Feed-forward MLP ANNs x validation study
double hidden layer 5-5-1
31
Eigenvalue structure of principal components of
input data matrix
32
Figure The cefficients for principal component
scores for three input variables and 10 PCs
33
Figure The cefficients for principal component
scores for three input variables and 10 PCs
34
Results and Conclusions-PLS

Simplistic statistical PLS analysisdelineates
the relative importance of predictors on the
quantities of responses.
PLS modeling using each time a single response
gave better results as compared to multivariate
(multi response) modeling approach.
PLS modeling approach with single response
modeling has yielded best results for four of the
variables Log10 (Total phytoplankton count),
Chl-a, Cyclotella ocellata and Sphaerocystis
schroeteri.

35
Table . The first five principal components for
each response considering absolute magnitudes of
standardized regression coefficients
36
(No Transcript)
37
(No Transcript)
38
Rankings of models based on performance and
precision criteria
39
Table . Rankings of models based on performance
and precision criteria
40
Table . Rankings of models based on performance
and precision criteria
41
Thank you for your time and patience!

Comments and Suggestions are welcome!

42
PLS (Additional-1)

Predictive technique to handle many independent
variables, even when these display
multicollinearity.
Avantages
handles multiple dependents as well as multiple
independents.
Ability to handle multicollinearity among the
independents
Creating independent latents directly on the
basis on cross products involving the response
variables
Making stronger predictions
Disadvantage
Difficulty of interpreting the loadings of
independent latent variables

43
PLS (Additional-2)

Cross-validation procedure
For each potential model,
1 Omit one observation or group of
observations, depending on the cross-validation
method you use.
2 Recalculate the model without the
observation/group of observations.
3 Predict the response, or the cross-validated
fitted value, for the omitted observation/group
of observations using the recalculated model and
calculates the cross-validated residual value.
4 Repeat steps 1-3 until all observations have
been omitted and fit.
5 Calculate the prediction sum of squares
(PRESS) and predicted R2 values.

44
PLS (Additional-3)

After performing steps 1-5 for each model,
Minitab selects the model with the number of
components that produces the highest predicted R2
and lowest PRESS. With multiple response
variables, Minitab selects the model with the
highest average predicted R2 and lowest average

Write a Comment

User Comments (0)