Program for North American Mobility in Higher Education - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Program for North American Mobility in Higher Education

Description:

Title: PowerPoint Presentation Author: lafourcs Last modified by: Agnes Devarieux-Martin Created Date: 7/25/2001 7:57:15 PM Document presentation format – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 40
Provided by: laf85
Category:

less

Transcript and Presenter's Notes

Title: Program for North American Mobility in Higher Education


1
NC STATE UNIVERSITY
Program for North American Mobility in Higher
Education Introducing Process Integration for
Environmental Control in Engineering
Curricula MODULE 17 Introduction to
Multivariate Analysis
Created at Ecole Polytechnique de Montreal
North Carolina State University, 2003.
2
TIER 2 Worked Examples
3
Tier 2 Statement of Intent
  • Tier 2 Statement of intent
  • The goal of Tier 2 is to demonstrate the various
    MVA concepts using real examples. At the end of
    Tier 2, the student should be able to grasp the
    following
  • How to read the basic MVA outputs
  • How to deal with raw, messy data
  • How to deal with a large number of variables
  • How to deal with shorter timespans
  • The purpose is to teach the concepts behind MVA,
    and not merely how to run the software itself,
    which could be gleaned from any users manual.
    The biggest danger of this technique is using the
    software blindly, without understanding whats
    inside the black box.

4
Tier 2 Contents
Tier 2 is broken down into four sections 2.1
Where are the data coming from? 2.2 Example 1
PCA on Raw, Messy Data 2.3 Example 2 Using
Fewer Variables 2.4 Example 3 Using Shorter
Timescales At the end of Tier 2 there is a short
multiple-answer quiz.
5
2.1 Where are the data coming from?
6
Where are data coming from?
A standard joke is that teenagers think milk
comes from a refrigerator. Similarly,
we could wrongly say that process data come from
the plants data historian. They are, of course,
generated somewhere else. We must fully
understand each data tag if we are to make
sense of the final MVA results.
7
Types of Data Tags
  • A tag is a label or address for a certain
    measurement. For instance, the tag TempRT01
    might refer to the temperature measured by a
    thermocouple in the top of reactor 1, in degrees
    Celsius, updated every 5 seconds. There are five
    major categories of tags, shown in descending
    order of immediacy
  • Immediate, on-line
  • These are instantaneous readings, like those
    provided by a pressure gauge. Even if the
    instrument operates continuously, there will be a
    sampling frequency which we must know and
    understand.
  • Delayed, on-line
  • These are delayed readings, like those from an
    on-line water quality analyser. Not only must we
    understand the sampling frequency, but also the
    lag between the time the sample is taken and the
    time the values are logged.

8
Types of Data Tags (contd.)
  • Delayed, off-line
  • This category is even further removed, in that
    samples are taken manually to an automatic
    analyser. Here the lag between sampling and
    logging of analytical results may be different
    from sample to sample.
  • Manual, off-line
  • These are laboratory measurements which are
    logged by hand, often literally typed into the
    system on a keyboard by a human being.
  • Calculations
  • These are values calculated from other tags.

9
Timescales
Each value in the database will also have a
timescale associated with it. Discrete values
are taken only at the precise instant in
question. For example the main steam header
pressure at exactly 1000 a.m., zero seconds. If
no reading was taken at that precise moment, the
discrete value is 0 (or 999 or blank or
N/A) Average values are the mean or median over
some designated timespan, for instance the
average main header steam pressure between 959
a.m. and 1000 a.m. Frequency of measurement and
of data-logging is extremely important. Some
values may be updated every few seconds, while
others only twice a day.
10
Process Lags
If you are using daily averages for your MVA,
then a ten-minute residence time in a reactor or
vessel will not impact your results. However, if
you are comparing one-minute averages, then
obviously such a process lag must be taken into
account. Estimating these lags is not
obvious, since they can change with time (e.g.,
fluctuating tank levels).
11
Preparing the Spreadsheet
  • Generally, the data are downloaded into a
    standard spreadsheet, which then serves as the
    input to the MVA software.
  • This offers several advantages
  • Rows and columns can be set up appropriately,
    with tag numbers, long variable names, short
    variable names (to show on plots), observation
    numbers, time stamps and so forth. This greatly
    facilitates the use of the MVA software.
  • Additional calculations can be done, if required,
    for instance taking the log of certain variables
    for use in the MVA analysis.
  • Time lags can be incorporated right from the
    start, by shifting data from certain tags forward
    or backward in time. For instance, input
    variables for a process with a 30-minute
    residence time can be shifted to the same row as
    product quality variables measured 30 minutes
    later.

12
2.2 Example (1) PCA of Raw, Messy Data
13
Process Example TMP Refining line
All the examples in Tier 2 are based on the
thermo-mechanical pulping (TMP) process, used to
convert wood chips into pulp. This is a
straightforward process, with well known
underlying physical characteristics. A
generic flowsheet for the TMP process is shown on
the next page. The wood chips are about 3 cm x 4
cm x 0,5 cm. They are pre-heated and pass
through two refiners where huge spinning disks
cut them down into individual cellulose fibre
strands. The resulting pulp, a cellulose-water
slurry, resembles the stuffing in a disposable
diaper. This pulp is held for 45 minutes in the
latency chest, to allow the cellulose strands to
disentangle themselves, before being sent to the
papermaking section of the plant.
TMP is used to make newsprint
Example 1
14
Thermomechanical Pulping (TMP) Generic Flowsheet
Ys
Xs
45-minute residence time
No expertise on the TMP process is required to
understand the examples.
Example 1
15
Dozens of Variables Measured
The many dozens of variables that are measured on
a TMP line fall into two categories, those which
impact the process (Xs) and those which are
impacted by the process (Ys). Note that for
some variables, this categorisation is not
obvious.
Final product quality
Raw material quality
Unit operation 1
Unit operation 2
Y
X
What about intermediate product quality?
X or Y?
Example 1
16
The Actual Data Used
  • The data used in this example came from a real
    TMP mill in North America. The data have been
    modified to ensure that no confidential
    information is revealed.
  • About 130 tags were selected, corresponding to
    the X and Y list on the next page. It is not
    necessary for the student to understand all
    these, just to be aware that it is complicated
    and involves many different measurements.
  • Remember the terminology
  • Variables These are the types of
    measurements or tags (e.g., refiner body
    temperature). Variables are shown on the
    Loadings plot.
  • Observations These are the individual
    measurements, separated in time (March 19, 2000).
    Observations are shown on the Score plot.

Example 1
17
The X and Y Variables
  • The X variables for the TMP process are
  • Incoming chips size distribution, bulk density,
    humidity.
  • Refiner operating data throughput specific
    energy imparted to the chips energy split
    between the primary and secondary refiner
    vertical and conical plate distances dilution
    rates levels, pressures and temperatures in
    various units immediately connected to the
    refiners voltage at chip screw conveyors
    specific hydrosulphite consumption refiner body
    temperature.
  • Season, represented by the average monthly
    temperature measured at a nearby meteorological
    station.
  • The Y variables are
  • Steam generation rate (an indicator of waste
    heat generated by friction inside the refiners)
  • Pulp quality data after the latency chest
    (automated, on-line analysis of grab samples)
    standard industry parameters including fibre
    length distribution, freeness, consistency, and
    brightness.

Example 1
18
Pretreatment of data
For this first example, daily averages were
obtained for all 130 tags over a 34-month period,
corresponding to 1044 observations. Note that
the data historian can provide averages over many
different time periods, from seconds to
months. The purpose of this exercise was
simply to determine which variables trended
together over this multi-year period. The
spreadsheet contained over 100,000 values (130
variables x 1044 observations), obviously far too
much for manual analysis. Because these are
daily averages, the 45-minute residence time in
the latency was ignored.
Daily averages
Example 1
19
PCA of All the Data
As a first step, all the data were put into the
MVA program to look for outliers. No distinction
was made between Xs and Ys (everything lumped
together). The software immediately rejected
four variables for having zero or close to zero
variance. This means that they did not vary
enough to be of use to the MVA exercise
(remember, this is not a planned experiment).
The rest of the variables were accepted.
The score plot for this initial PCA
exercise is shown on the next page.
Some variables did not change enough to be
accepted by the MVA software tool
Example 1
20
Initial PCA Score Plot
Already something looks suspicious. Note how a
small number of observations dominate the rest.
MVA is extremely sensitive to outliers. What do
you notice about the dates?
Example 1
21
Extreme Outliers
Some of these strange dates fall on Christmas Eve
and Christmas Day! These holidays are radically
different somehow. An obvious guess is that
production was lower on those days. To confirm
this we check the original data.
Example 1
22
Low Production Days!
Days with production lt 100 t/d
Days with production lt 50 t/d
Our suspicions are confirmed. A quick check of
the original dataset shows that all these dates
correspond to lower production.
Example 1
23
Decision to remove outliers
Now that we know why these dates are outliers, we
can remove them with confidence. It is
generally a bad idea to remove outliers without
determining why they are different. It may be
that these are not outliers at all, but actually
interesting and important shifts in the process
the very thing we would like to know
about. Determining the cause of outliers is
usually more difficult than this Christmas
holiday example. We will see other techniques
in the examples that follow.
Chopping the outliers
Example 1
24
PCA with extreme outliers removed
Much better (on average, 5 of observations are
supposed to be outside the ellipse)
SECOND COMPONENT ALONG THIS AXIS
FIRST COMPONENT ALONG THIS AXIS
Here is the new score plot, with low production
days removed. It hardly resembles the initial
one proof of the extreme effect of outliers.
Example 1
25
R2 and Q2 for PCA Model
This is the R2 and Q2 plot for this same model.
The R2 values tell us that the first component
explains 32 of the variability in the original
data, the second another 7 and the third another
6. The Q2 values are lower, as always. This
means that the predictive power of the model is
around 40 when using all three components. This
may seem low, but is normal for real process data.
Example 1
26
Moderate outliers in residuals
Moderate outliers
EACH POINT IS AN INDIVIDUAL DAY (DATES NOT
LEGIBLE)
This is the Distance to Model or residual plot
for this model. It shows the distance, in
multi-dimensional space, between each real
observation (date) in the initial dataset and the
predicted value based on the model. Clearly
there are some moderate outliers that need
investigating, different to the extreme
outliers we saw on the score plot. This can be
done by looking at the original data, or using
other techniques
Example 1
27
Looking at the Results
So what do these results mean? Obviously the
score plot showing the dates is totally
illegible. We will therefore remove the date
label. However, in order not to lost the
seasonal information, we will colour-code each
day to show which time of year it occurred in.
It is very easy to modify the graphical outputs
in this way. Lets have a look at the result.
Example 1
28
Score plot of first 2 components
Note that all days lt 100 t/d were systematically
removed, plus major outliers. In all, only a few
dozen observations were removed (out of 1044).
Same plot as before, only backwards
(mathematically identical)
Variation in this direction appears to occur
BETWEEN individual seasons (? Component 2)
AutumnWinter Spring Summer
Variation in this direction appears to occur
WITHIN a given season (? Component 1)
Example 1
29
First 3 components
AutumnWinter Spring Summer
To show the first 3 components, we need a 3-D
plot of course. The third component is on the
vertical axis. If the points were to drop onto
the bottom surface, you would just get the
previous image.
Each point represents an INDIVIDUAL DAY
2000
2001
By looking at the original data, it became clear
that the three years were separated in the 2nd
component
2002
30
Loadings Plot
The MVA software generates a set of new axes
called components that are statistically
significant. However, the software does not tell
us what these new components actually mean. To
figure out how the original variables relate to
the newly created MVA components, we must look at
the Loadings Plot. For this example, the 1st /
2nd component loadings plot is shown on the next
slide. It looks somewhat daunting, because the
tag numbers are shown. It is not necessary for
the purposes of this exercise to understand what
all the tag numbers mean. The important point is
that similar tags trend together, as indicated by
the text box. In this case, many variables
related to the throughput tend to increase and
decrease together, as shown by their clustering.
Also, they are clearly related to the first
component, on the negative side (positive and
negative are totally arbitrary in MVA component
space).
Example 1
31
PCA Loadings Plot (p1/p2)
Pulp throughput Refining energy Dilution
flows Steam generation
ORIGIN
See-saw principle
Example 1
32
Conclusions p1
-

INTERPRETATION Component 1 Throughput
Example 1
33
Interpretation of 1st component
Our conclusion is that the first component
corresponds to throughput. This is logical, for
two reasons 1) many process variables are
related either directly or indirectly to
throughput 2) The extreme outliers we removed
at the beginning, which dominated the model, were
also related to throughput (low production
days) Now we are ready to look at the score plot
again. Remember we said that the 1st component
was something that varied within an individual
season? Now we know what it is throughput. So
what have we accomplished? Weve reduced the
dimensionality by going from dozens of variables
to a single latent variable.
Example 1
34
2nd component Same plot as before
Bleach consumption
Pulp brightness Season
35
Interpretation of 2nd component
If you recall, we said that the 2nd component
explains only 7 of the total variability. It is
therefore messier than the first component, and
will be less easy to interpret. We also noted
that the three years were separated with respect
to this second component. A major clue occurs in
the prominence of two important and related tags
bleach consumption and pulp brightness. This
would suggest that perhaps the brightness of the
incoming wood chips was different from year to
year, requiring more bleaching to get a less
white pulp. Note also that Season is
prominent. We already knew this, by the obvious
separation of the seasons on the score plot.
This suggests that winter chips are less bright
than summer chips.
Example 1
36
Conclusions p1 p2
INTERPRETATION Component 2 Brightness of
incoming wood chips
-

Example 1
37
Looking at 3rd component
To look at the 3rd component, we must generate a
new plot showing the 1st component vs. the 3rd.
In other words, we ignore the 2nd component.
This 3rd component is orthogonal, and thus
statistically independent, to the first two
components. We said that the 3rd component
explains only 6 of the total variability. It is
therefore even messier than the 2nd
component. Lets have a look at this new
score plot. Note that this is exactly the image
you would get if all the points on the 3-D score
plot were projected onto the back wall.
1st vs. 3rd
Example 1
38
PCA t1 t3
No segregation by year SUMMERS VS. WINTERS!
AutumnWinter Spring Summer
39
Looking at 3rd component
One very interesting results is that the three
years are not separated on this plot. All the
winters line up, and all the summers line
up. This suggests that the 3rd component
is related to the time of year, pure and simple.
This is confirmed by the the corresponding
loadings plot, which shows SEASON to be the
single most prominent variable. A reasonable
interpretation would be that summer chips differ
from winter chips in some way other than
brightness, which was already covered by the
second component. This could be, for instance,
the ease with which the wood fibres can be
separated from each other.
Example 1
Write a Comment
User Comments (0)
About PowerShow.com