Predicting Dominant Phytoplankton Quantities in Reservoirs Using PCA Based Artificial Neural Network - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Predicting Dominant Phytoplankton Quantities in Reservoirs Using PCA Based Artificial Neural Network

Description:

Ersin Kivrak , Hasan G rb z , H revren Kili and Sel uk Soyupak. 9/20/09. 2 ... NO2-N, NO3-N, K , Na , Ca , Mg , SO4--, Cl- and HCO3-, Alk. ... Alk. Input ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 45
Provided by: exp6
Category:

less

Transcript and Presenter's Notes

Title: Predicting Dominant Phytoplankton Quantities in Reservoirs Using PCA Based Artificial Neural Network


1
Predicting Dominant Phytoplankton Quantities in
Reservoirs Using PCA Based Artificial Neural
Networks
  • Ersin Kivrak , Hasan Gürbüz , Hürevren Kiliç and
    Selçuk Soyupak

2
Titles
  • Introduction
  • Materials and methods
  • Demirdöven Reservoir
  • Sampling and measurements
  • Data analysis and modeling methods
  • Results and Discussions
  • Conclusions

3
Introduction-Limnological Data Bases
  • Limnological data bases include
  • Water quality information
  • Nutrients ( P and N compounds)
  • Cations (Na, K, Mg, Ca)
  • Anionic properties (SO4)
  • Acidity, alkalinity , pH, Temperature ,
    Conductivity, Dissolved Oxygen
  • Light penetration properties
  • location (depth) and time
  • Primary productivity ( Cholorophyll-a)
  • Phyroplanktonic and zooplanktonic structure
    (Species counts and dominance)

4
Introduction-CaseLimnological Data Base for DDR
  • T, DO, pH, EC, 20 cm diameter Secchi Disc
    transparency (SD), PO4-P, NH4-N, NO2-N, NO3-N,
    K, Na, Ca, Mg, SO4--, Cl- and HCO3-, Alk.
  • Primary production ( Chl-a) and
  • Phytoplankton species counts for families.
  • Bacillariophyta
  • Chlorophyta
  • Euglenophyta
  • Dinophyta
  • Type of data snap-shot day time data with
    monthly intervals

5
Introduction-Purpose of the research
  • Main purpose of this specific research
  • Quantification of
  • Responses (Y) Primary production ( Chl-a),
  • Total phytoplankton counts, and
  • Dominant Phytoplankton species counts from the
    families.
  • Bacillariophyta ,Chlorophyta,Euglenophyta,
    Dinophyta
  • utilizing
  • The predictors(X)
  • T, DO, pH, EC, 20 cm diameter Secchi Disc
    transparency (SD), PO4-P, NH4-N, NO2-N, NO3-N,
    K, Na, Ca, Mg, SO4--, Cl- and HCO3-, Alk.

6
Introduction-Purpose of the research
  • The problem identification
  • The number of independent and dependent variables
    (X) were many,
  • They displayed multicollinearity.
  • The selections of predictors in primary
    production quantification were generally based on
    the data availability and common expert opinion
    rather than quantitative arguments based on the
    structures of predictor and response matrices.
  • We have investigated the possibilities of
    developing a systematic approach for efficient
    interpretation and utilization of the snap-shot
    day-time data with monthly intervals for
    quantification of primary productivity in a
    reservoir (rather than expert judgement) .

7
Materials and methods(1)The selected water body
  • Water body was DDR
  • A relatively small reservoir
  • A reservoir for irrigation
  • Start of operation 1995.
  • The max. operation level The surface area1.45
    square km 2
  • Useful volume 44.5x106 cubic m

8
(No Transcript)
9
Materials and methods(2)The selected water body
Table 1. The descriptive statistical summary for
predictors
10
Materials and methods(3)The selected water body
Table 1. The descriptive statistical summary for
predictors and responses
11
Table 2.  The frequencies of phytoplankton
species in percentages (i.e. number of samples
where the species was found/ total number of
samples). Total number of samples16.
12
Materials and methods(5)The selected water body
  • The trophic state of the reservoir
  • oligo-mesotrophic as indicated by
  • primary productivity levels,
  • the dominant phytoplankton species
  • phytoplankton distribution and succession,
  • the levels of nutrient concentrations, and
  • the level of anthropogenic activities in the
    catchment area .

13
Materials and methods(6)Available data matrices
  • The data
  • Monthly from April to November (within years 2000
    and 2001).
  • December to March Ice cover period.
  • (X) an input matrix (Predictor matrix)?11218
  • 112 pieces of observations for each of the
    eighteen predictor variables (time (Ti), D, T,
    DO, Alk, pH, EC, PO4-P, NH4-N, NO2-N, NO3-N, K,
    Na, Ca, Mg, SO4--, Cl- and HCO3 ).
  • (Y)Response matrix of column vectors. 1127
  • The target variables were total phytoplankton,
    Bacillariophyta, Chlorophyta, dominating
    phytoplankton species (Sphaerocystis schroeteri,
    Staurastrum longiradiatum, Cyclotella ocellata),
    and Chl-a concentrations.

14
Materials and methods(6)Quantificaton Approaches
  • Base-line Feedforward MLP ANN models
  • Base-line Feedforward MLP ANN models (Double
    layer)
  • Base-line Feedforward MLP ANN models (Single
    layer)
  • MLP ANN Models with PREPCA applications
  • Elman Network ANN Models
  • Partial least squares

15
Materials and methods(7) Data analysis and
modeling methods-ANNs(1)
  • Pre-processing
  • Randomization
  • Normalization To increase the efficiency of
    training, the network inputs and targets were
    scaled by normalizing the input and target values
    so that they have a zero mean and a unity
    standard deviation.
  • Data divison ½ of the data has been utilized
    for training, ¼ for validation and ¼ for testing.

16
Materials and methods(8) Data analysis and
modeling methods-ANNs(2)
  • Types of ANN Models
  • Base-line Feedforward MLP ANN models all
    predictors in model development with double layer
    structure (5 neurons in each layer).
  • Base-line Feedforward MLP ANN models all
    predictors in model development with single
    layer structure ( 10 neurons).
  • MLP ANN Models with PREPCA applications Some
    additional ANN models were developed after
    reducing the dimension of input matrix by using
    the method PREPCA ( by eliminating the components
    that contribute less than 5 to the total
    variation in the data set.)
  • Elman Network ANN Models

17
Materials and methods(9) Data analysis and
modeling methods-ANNs(2)
  • Training methods
  • 1)The BFGS algorithm based on quasi-Newton (or
    secant) method was employed in training MLP NNs.
    The method is based on Newtons method of
    optimization that employs the basic step given by
    Equation
  •                                 x k1 xk -Ak-1
    gk    
  • Ak the second derivatives (or Hessian matrix)
    of the performance index at the current values of
    the weights and biases,
  • Xk a vector of current weights and biases,
  • gk the current gradient. Quasi-Newton method
    eliminates the calculation of second derivatives,
    however updates an approximate Hessian Matrix at
    each iteration of algorithm. The update is
    computed as a function of gradient.

18
Materials and methods(10) Data analysis and
modeling methods-ANNs(2)
  • Training methods
  • 2)As an alternative Elman Network .
  • Elman Networks differ from the conventional
    two-layer feed forward MLP ANN models in that the
    first layer has recurrent connection.
  • The addition of a feedback connection from the
    output of the hidden layer to its input enables
    it to recognize both temporal and spatial
    patterns.
  • On the other hand, it is known fact that for
    best chance at the learning problem it needs more
    hidden neurons in its hidden layer than the other
    models.

19
Materials and methods(11) Data analysis and
modeling methods-ANNs(3)
  • Cross-validation studies
  • When training was complete the simulation
    results were de-normalized by reversing the
    action both for applications with or without
    PREPCA.
  • Elimination of the over-fitting was attempted
    to be eliminated by method of regularization.
  • R and RMSE for performance and model precision
    evaluation.

20
Materials and methods(12)Data analysis and
modeling methods-PLS(1)
  • Since the data have multicollinearity problem we
    have considered PLS method as an alternative
    predictive tool to MLP ANNs.
  • PLS Predictive technique to handle many
    independent variables, even when these display
    multicollinearity.
  • Advantages
  • handles multiple dependents as well as multiple
    independents.
  • Ability to handle multicollinearity among the
    independents
  • Creating independent latents directly on the
    basis os cross products involving the response
    variables
  • Making stronger predictions
  • Disadvantage
  • Difficulty of interpreting the loadings of
    independent latent variables

21
Materials and methods(13)Data analysis and
modeling methods-PLS(2)
  • Partial least squares (PLS)
  • PLS method has been applied as an alternative
    approach due to following basic reasons
  • 1) The predictor matrix had 18 components that
    can be considered high in number,
  • 2) Some of these components were correlated,
    and finally
  • 3) The responses were many (in our case it is 7)
    and some were also correlated.

22
Materials and methods(14)Data analysis and
modeling methods-PLS(3)
  • PLS regression specifically examines both
    predictor and response matrices to find latent
    vectors that explain as much as possible of the
    co-variance between predictors and responses.
  • The adopted PLS algorithm generates a sequence of
    models, where each consecutive model contains one
    additional component.
  • Cross-validation step determines the number of
    components that minimizes the prediction error
    during this research, omitting one observation
    at-a time methodology has been adopted. The
    utilized algorithm selects the model with the
    number of components that produces the highest
    predicted correlation coefficient (R).

23
Materials and methods(14)Data analysis and
modeling methods-PLS(4) Model fitting
  • The nonlinear iterative partial least squares
    (NIPALS) algorithm developed by Herman Wold have
    been adopted.
  • PLS reduces the number of predictors by
    extracting uncorrelated components based on the
    covariance between the predictor and response
    variables Frank and Kowalski .
  • The PLS algorithm produces a sequence of models,
    where each consecutive model contains one
    additional component. Components are calculated
    one at a time, starting with the standardized x-
    and y-matrix. Subsequent components are
    calculated from the x- and y-residual matrix
    iterations stop upon reaching the maximum number
    of components or when x-residuals become the zero
    matrix.
  • Cross-validation is used to identify the number
    of components that minimizes prediction error.

24
RESULTS CONCLUSIONS
25
DOMINATING SPECIES
26
Table 2.  The frequencies of phytoplankton
species in percentages
27
Conclusions-MLP ANNs(1)
  • Feed-forward MLP back propagation ANNs utilizing
    quasi-Newton algorithm with full predictor matrix
    have yielded best results for variables
    Bacillariophyta, Chlorophyta, Staurastrum
    longiradiatum.
  • double layer structures have yielded better
    results as compared to single layer structure
    utilizing the same number of neurons.
  • ANN models with PREPCA applications did not bring
    additional precision or better performance as
    compared to standard ANN modeling with full
    predictor matrices since 14 of 18 predictors have
    been identified to be responsible for 95 of
    variability after an eigen-value analysis.

28
Conclusions-MLP ANNs(2)
  • The signs and the magnitudes of the coefficients
    of principle components (PCs) indicate that the
    all the variables of the input matrix have
    different effect levels on different PCs,
    therefore their inclusions are justifiable.

29
Feed-forward MLP ANNs double hidden layer 5-5-1

30
.
Feed-forward MLP ANNs x validation study
double hidden layer 5-5-1
31
Eigenvalue structure of principal components of
input data matrix
32
Figure The cefficients for principal component
scores for three input variables and 10 PCs
33
Figure The cefficients for principal component
scores for three input variables and 10 PCs
34
Results and Conclusions-PLS
  • Simplistic statistical PLS analysisdelineates
    the relative importance of predictors on the
    quantities of responses.
  • PLS modeling using each time a single response
    gave better results as compared to multivariate
    (multi response) modeling approach.
  • PLS modeling approach with single response
    modeling has yielded best results for four of the
    variables Log10 (Total phytoplankton count),
    Chl-a, Cyclotella ocellata and Sphaerocystis
    schroeteri.

35
Table . The first five principal components for
each response considering absolute magnitudes of
standardized regression coefficients
36
(No Transcript)
37
(No Transcript)
38
Rankings of models based on performance and
precision criteria
39
Table . Rankings of models based on performance
and precision criteria
40
Table . Rankings of models based on performance
and precision criteria
41
Thank you for your time and patience!
  • Comments and Suggestions are welcome!

42
PLS (Additional-1)
  • Predictive technique to handle many independent
    variables, even when these display
    multicollinearity.
  • Avantages
  • handles multiple dependents as well as multiple
    independents.
  • Ability to handle multicollinearity among the
    independents
  • Creating independent latents directly on the
    basis on cross products involving the response
    variables
  • Making stronger predictions
  • Disadvantage
  • Difficulty of interpreting the loadings of
    independent latent variables

43
PLS (Additional-2)
  • Cross-validation procedure
  • For each potential model,
  • 1    Omit one observation or group of
    observations, depending on the cross-validation
    method you use.
  • 2    Recalculate the model without the
    observation/group of observations.
  • 3    Predict the response, or the cross-validated
    fitted value, for the omitted observation/group
    of observations using the recalculated model and
    calculates the cross-validated residual value.
  • 4    Repeat steps 1-3 until all observations have
    been omitted and fit.
  • 5    Calculate the prediction sum of squares
    (PRESS) and predicted R2 values.

44
PLS (Additional-3)
  • After performing steps 1-5 for each model,
    Minitab selects the model with the number of
    components that produces the highest predicted R2
     and lowest PRESS. With multiple response
    variables, Minitab selects the model with the
    highest average predicted R2 and lowest average
Write a Comment
User Comments (0)
About PowerShow.com