Title: Discovery of Patterns in the Global Climate System using Data Mining
1Discovery of Patterns in the Global Climate
System using Data Mining
- Vipin Kumar
- Army High Performance Computing Research Center
- Department of Computer Science
- University of Minnesota http//www.cs.umn.edu/
kumar - Collaborators
- G. Karypis, S. Shekhar, M. Steinbach, P.N. Tan
(AHPCRC), - C. Potter, (NASA Ames Research Center),
- S. Klooster (California State University,
Monterey Bay). - This work was partially funded by NASA and Army
High Performance Computing Center
2Research Goals
- Research Goals
- Find global climate patterns of interest to
Earth Scientists
A key interest is finding connections between the
ocean and the land.
- Global snapshots of values for a number of
variables on land surfaces or water. - Monthly over a range of 10 to 50 years.
3Sources of Earth Science Data
- Before 1950, very sparse, unreliable data.
- Since 1950, reliable global data.
- Ocean temperature and pressure are based on data
from ships. - Most land data, (solar, precipitation,
temperature and pressure) comes from weather
stations. - Since 1981, data has been available from earth
orbiting satellites. - FPAR, a measure related to plants and greenness
- Since 1999 TERRA, the flagship of the NASA earth
observing system, is providing much more detailed
data.
4Importance of Global Climate Patterns
- The climate of the Earths land surface is
strongly influenced by the behavior of the
Earths oceans. - El Nino is the anomalous warming of the eastern
tropical region of the Pacific. - Associated with droughts in Australia and
Southern Africa and heavy rainfall along the
western coast of South America.
El Nino Events
Sea Surface Temperature Anomalies off Peru (ANOM
12)
5Importance of Global Climate Patterns and NPP
- Net Primary Production (NPP) is the net
assimilation of atmospheric carbon dioxide (CO2)
into organic matter by plants. - NPP is driven by solar radiation and can be
constrained by precipitation and temperature. - Keeping track of NPP is important because it
includes the food source of humans and all other
organisms. - Sudden changes in the NPP of a region can have a
direct impact on the regional ecology. - NPP is impacted by global climate patterns.
- Precipitation and temperature are directly
affected by global climate patterns such as El
Nino. - Solar radiation is affected indirectly by
cloudiness.
6Role of Statistics and Data Mining
- Previously Earth scientists have relied on
statistical techniques. - Hypothesize-and-test paradigm is extremely
labor-intensive. - Data mining provides earth scientist with tools
that allow them to spend more time choosing and
exploring interesting families of hypotheses. - By applying the proposed data mining techniques,
some of the steps of hypothesis generation and
evaluation will be automated, facilitated and
improved. - However, statistics is needed to provide methods
for determining the statistical significance of
results.
7Patterns of Interest
- Zone Formation
- Find regions of the land or ocean which have
similar behavior. - Teleconnections
- Teleconnections are the simultaneous variation in
climate and related processes over widely
separated points on the Earth. - Associations
- Find relations between climate events and land
cover. - River Discharge
- Relationship between water discharged from a
river and precipitation, climate, and man.
8Clustering for Zone Formation
- Interested in relationships between regions, not
points. - For ocean, clustering based on SST (Sea Surface
Temperature) or SLP (Sea Level Pressure). - For land, clustering based on NPP or other
variables, e.g., precipitation, temperature. - Typically we work with the points.
- When raw NPP and SST are used, clustering can
find seasonal patterns. - Anomalous regions have plant growth patterns
which reversed from those typically observed in
the hemisphere in which they reside, and are easy
to spot.
9K-Means Clustering of Raw NPP and Raw SST (Num
clusters 2)
10K-Means Clustering of Raw NPP and Raw SST (Num
clusters 2)
Land Cluster Cohesion North 0.78, South
0.59 Ocean Cluster Cohesion North 0.77, South
0.80
11K-Means Clustering of Raw NPP and Raw SST (Num
clusters 6)
12Preprocessing
- Time series preprocessing issues
- Need to remove seasonality
- Earth scientists mostly interest in anomalies
- Need to remove most of the autocorrelation
- Statistical test are affected
- Need to remove trends
- Normally want to detect patterns and trends
separately - Normally interested in similarity once
differences in means and scale have been
considered. - Pearsons correlation coefficient has this
property
13Sample NPP Time Series
Correlations between time series
14Seasonality Accounts for Much Correlation
Normalized using monthly Z Score Subtract off
monthly mean and divide by monthly standard
deviation
Correlations between time series
15Removing Seasonality Removes Most Autocorrelation
16Preprocessing Removing Trends
A slight linear trend added to two random time
series increases their correlation dramatically,
from 0.01 to 0.17.
17Ocean Climate Indices Connecting the Ocean and
the Land
- An OCI is a time series of temperature or
pressure - Based on Sea Surface Temperature (SST) or Sea
Level Pressure (SLP) - OCIs are important because
- They distill climate variability at a regional or
global scale into a single time series. - They are well-accepted by Earth scientists.
- They are related to well-known climate phenomena
such as El Niño.
18Ocean Climate Indices ANOM 12
- ANOM 12 is associated with El Niño and La Niña.
- Defined as the Sea Surface Temperature (SST)
anomalies in a regions off the coast of Peru - El Nino is associated with
- Droughts in Australia and Southern Africa
- Heavy rainfall along the western coast of South
America - Milder winters in the Midwest
El Nino Events
19Connection of ANOM 12 to Land Temp
- OCIs capture teleconnections, i.e., the
simultaneous variation in climate and related
processes over widely separated points on the
Earth.
20Ocean Climate Indices - NAO
- The North Atlantic Oscillation (NAO) is
associated with climate variation in Europe and
North America. - Normalized pressure differences between Ponta
Delgada, Azores and Stykkisholmur, Iceland. - Associated with warm and wet winters in Europe
and in cold and dry winters in northern Canada
and Greenland - The eastern US experiences mild and wet winter
conditions.
Iceland
Azores
21Connection of NAO to Land Temp
22Influence of OCI on Land Area Weighted
Correlation
- Correlation of an OCI with a land variable is a
standard way to evaluate its influence. - Correlation does not imply causality.
- Temperature and precipitation are the typical
land variables. - If relatively many land points have a relatively
high correlation, then an OCI is influential. - To evaluate whether clusters (or pairs) are
potential OCIs we compute their area weighted
correlation. - Weighted average of the correlation with land
points, where weight is based on area. - May exclude points whose correlation is low and
then calculate area weighted correlation.
23Evaluation of Known OCIs via Area Weighted
Correlation
Area Weighted Correlation of Known OCIs to Land
Temp Overlapping, threshold 0
24Evaluation of Known OCIs via Area Weighted
Correlation
Area weighted correlation declines as we consider
only land points whose temperature correlates
with the OCI above a given threshold.
25Discovering OCIs via Data Mining
- Earth scientists have discovered currently known
OCIs. - Observation
- Eigenvalue techniques such as Principal
Components Analysis (PCA) and Singular Value
Decomposition (SVD). - Clustering provides an alternative approach.
- Clusters represent ocean regions with relatively
homogeneous behavior. - The centroids of these clusters are time series
that summarize the behavior of these ocean areas,
and thus, represent potential OCIs.
26Finding Influential Ocean Regions
- Not all points on the ocean correlate well with
land variables such as temperature and
precipitation. - Best points are those which have a high density
- Dense points are relatively homogenous with
respect to their neighboring points.
27Discovery of Ocean Climate Indices
- Use clustering to find areas of the oceans that
have high density, I.e., relatively homogeneous
behavior. - Cluster centroids are potential OCIs.
- For SLP pairs of cluster centroids are potential
OCIs. - Evaluate the influence of potential OCIs on
land points. - Determine if the potential OCI matches a known
OCI. - For potential OCIs that are not well-known,
conduct further evaluation. - Are there land points that have higher
correlation for the potential OCI than for known
indices?
28SST Clusters
29Evaluating Cluster Centroids as Potential OCIs
- Evaluation will be based on area weighted
correlation - Ignore clusters who area weighted correlation is
low. - Three cases
- Clusters are highly similar to known OCIs (corr gt
0.4) - May represent a known OCI
- Clusters may be better, i.e., higher coverage
- Clusters may cover different area, i.e., some
points for which the new OCI is a better
predictor - Clusters are moderately similar to known OCIs (
0.25 lt corr lt 0.4 ) - Again, new OCIs may be better predictors for some
points. - Clusters are not similar to known OCIs (corr lt
0.25) - These clusters may represent as yet undiscovered
Earth Science phenomena.
30SST Clusters Highly Correlated to Known Indices
Area Weighted Correlation of Cluster Centroids to
Land Temp Overlapping, threshold 0
31SST Clusters Highly Correlated to Known Indices
32SST Clusters that Correspond to El Nino Climate
Indices
75 78 67 94
El Nino Regions Defined by Earth Scientists
SNN clusters of SST that are highly correlated
with El Nino indices, 0.93 correlation.
33SST Clusters Highly Correlated to Known Indices
- Examples of some SST clusters that are
highly correlated to known OCIs and have high
area weighted correlation with land temperature.
These indices have a significant correlation with
El Nino indices.
34SST Clusters Highly Correlated to Known Indices
- However, there are areas (yellow) where these
clusters correlate better.
35SST Clusters Highly Correlated to Known Indices
36SST Cluster Moderately Correlated to Known Indices
37Comments from our NASA collaborators
- Ocean cluster results based on SST correlations
with land surface temperature suggest that -
- New areas of the ocean may be identified that are
unknown as being highly representative of the El
Nino Southern Oscillation (ENSO) and the Arctic
Oscillation (AO). - New predictive indices for land climate over the
past 40 years can be identified that will improve
upon predictions using any known ocean climate
index to date, including SOI and AO. -
38Issues in Mining Associations from Earth Science
Data
- Data is continuous rather than discrete.
- Data has spatial and temporal components.
- Data can be multilevel
- time and spatial granularities.
- Observations are not i.i.d. due to spatial and
temporal autocorrelations. - Data may contain noise, missing information and
measurement errors - historical SST data between 1856-1941 is measured
using wooden buckets. - Data may come from heterogeneous sources
- Calibration issues.
39Mining Associations in Earth Science Data
Challenges
- How to transform Earth Science data into
transactions? - What are the baskets?
- What are the items?
- How to define support?
40Mining Associations Patterns in Earth Science
Data Challenges
1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI gt
NPP-HI (support count145, confidence100) 2
FPAR-HI PET-HI PREC-HI TEMP-HI gt NPP-HI
(support count933, confidence99.3) 3 FPAR-HI
PET-HI PREC-HI gt NPP-HI (support count1655,
confidence98.8) 4 FPAR-HI PET-HI PREC-HI
SOLAR-HI gt NPP-HI (support count268,
confidence98.2)
- How to efficiently discover spatio-temporal
associations? - Use existing algorithms.
- Develop new algorithms.
41Event Definition
- Items are events abstracted from time series.
- Events of interest include
- Temporal events
- Anomalous temporal events such as warmer winters
and droughts. - Changes in the periodic behavior such as longer
growing seasons or earlier month of onset of
greenup. - Spatial events
- Large percentage of land areas in a certain
region having below-average precipitation. - Spatio-temporal events
- Changes in circulation or trajectory of
jet-streams.
42Example of Anomalous Event Definition
If threshold for Z ?1.5, on average, there are
20 events per time series.
43Transaction and Support Definitions
- Convert the time series into sequence of events
for each spatial location.
44Examples of Association Patterns
- min support 0.001, min confidence10
1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI gt
NPP-HI (support count145, confidence100) 2
FPAR-HI PET-HI PREC-HI TEMP-HI gt NPP-HI
(support count933, confidence99.3) 3 FPAR-HI
PET-HI PREC-HI gt NPP-HI (support count1655,
confidence98.8) 4 FPAR-HI PET-HI PREC-HI
SOLAR-HI gt NPP-HI (support count268,
confidence98.2) 5 FPAR-HI PET-HI PREC-HI
SOLAR-LO TEMP-HI gt NPP-HI (support count44,
confidence97.8) 6 FPAR-LO PET-LO PREC-LO
SOLAR-LO gt NPP-LO (support count216,
confidence96.9) 7 FPAR-LO PREC-LO SOLAR-LO
TEMP-HI gt NPP-LO (support count152,
confidence96.2) 8 FPAR-LO PET-LO PREC-LO
SOLAR-LO TEMP-LO gt NPP-LO (support count47,
confidence95.9) 9 FPAR-LO PREC-LO SOLAR-LO
TEMP-LO gt NPP-LO (support count49,
confidence94.2) 10 FPAR-LO PREC-LO SOLAR-LO gt
NPP-LO (support count595, confidence93.7)
75 FPAR-HI gt NPP-HI (support count
216924, confidence 55.7)
NPP Solar FPAR ? Temperature Moisture
45Example of Interesting Association Patterns
FPAR-Hi gt NPP-Hi (sup5.9, conf55.7)
46Land Cover Types
Shrublands/
47Using Land Cover as Additional Features
1. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI gt
NPP-HI (support count145, confidence100) 2.
FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI GRASSLAND
gt NPP-HI (support count145, confidence100) 3.
FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI FOREST
gt NPP-HI (support count44, confidence100) 4.
FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI CROPLAND
gt NPP-HI (support count44, confidence100) 5.
FPAR-HI PET-HI PREC-HI SOLAR-HI FOREST gt NPP-HI
(support count75, confidence100) 6. FPAR-HI
PET-HI PREC-HI SOLAR-HI CROPLAND gt NPP-HI
(support count81, confidence100) 7. FPAR-HI
PREC-HI SOLAR-HI TEMP-HI CROPLAND gt NPP-HI
(support count58, confidence100) 8. FPAR-HI
PET-HI PREC-HI TEMP-HI GRASSLAND gt NPP-HI
(support count376, confidence99.5) 9. FPAR-HI
PET-HI PREC-HI TEMP-HI CROPLAND gt NPP-HI
(support count170, confidence99.4) 10. FPAR-HI
PET-HI PREC-HI CROPLAND gt NPP-HI (support
count277, confidence99.3) ..
- Produce multiple rules that have the same form
- A gt B, A,Grassland gt B, A, Cropland
gt B, etc. - Some of the support counts could be missing if
itemsets fall below the minimum support threshold.
48Finding Interesting Earth-Science Patterns
- A pattern is interesting if it occurs relatively
more frequently in some homogeneous regions.
- If the relative frequency of a pattern is similar
in all groups of land areas, then it is less
interesting. - If the pattern occurs mostly in a certain group
of land areas, then it is potentially interesting.
49Filtering Patterns using Land Cover Types
- For each pattern p
- Actual coverage for land cover type i si /S
- Expected coverage for land cover type i ni /N
- Ratio of actual to expected coverage for land
cover type i, - ei si N / ni S
- Interest Measure
- If pattern occurs in arbitrary regions, interest
measure will be low.
50Interesting Spatial Association Pattern
51Interesting Spatial Association Pattern
Land Cover
- Prec-Hi ? NPP-Hi tends to occur in grassland and
cropland regions.
52Other Interesting Spatial Association Patterns
Support Count
Land Cover
- Temp-Hi ? NPP-Hi tends to occur in the forest
and cropland regions in the northern hemisphere
(Forests (33.5), Grassland(8.7),
Cropland (24.5), Desert (0.4) )
53Global River Discharge Data
- Global River Discharge Data
- 30 rivers, 0.5 degree resolution
- Two measurement stations mouth and source of
river system/basin - Minimum of ten continuous years of monthly
station discharge records - Interesting associations
- e.g., Amazon discharge is highly correlated with
ANOM3.4(r -0.5)
54Relationship Between River Basin PREC and OCI
Amazon
Parana
- Correlation between PREC aggregation on river
basins and OCI is shown in left figure - Interesting Observations
- Amazon and Parana are nearby, however, the
signals to OCI are almost reverse
55Discharge Data Amazon vs. Parana
56Relationship Between River Basin PREC and OCI ..
Petchora
- Interesting Observations
- Petchora and Pacific Decadal Oscillation (PDO)
are highly correlated.
57Correlation between PREC and DISCHARGE
58Correlation between OCI and DIS (r 0.3)
Amu-Darya
Brahmaputra
Amazon
Columbia
Colorado
59Proposed Framework of River Analysis
60Conclusions
- Association rules can uncover interesting
patterns for Earth Scientists to investigate. - Challenges arise due to spatio-temporal nature of
the data. - Need to incorporate domain knowledge to prune out
uninteresting patterns. - By using clustering we have made some progress
towards automatically finding climate patterns
that display interesting connections between the
ocean and the land. - Need to further evaluate candidates for new
climate indices. - Correlation analysis on river discharge data can
be used to evaluate the effects of climate and
man.
61Case Studies Earth Science Data
- Michael Steinbach, Pang-Ning Tan, Vipin Kumar,
Chris Potter, Steven Klooster, Alicia Torregrosa,
Clustering Earth Science Data Goals, Issues and
Results, Workshop on Mining Scientific Data, KDD
2001, San Francisco, CA, 2001. - Pang-Ning Tan, Michael Steinbach, Vipin Kumar,
Steven Klooster, Christopher Potter, Alicia
Torregrosa, Finding Spatio-Termporal Patterns in
Earth Science Data Goals, Issues and Results,
Temporal Data Mining Workshop, KDD 2001, San
Francisco, CA, 2001. - Vipin Kumar, Michael Steinbach, Pang-Ning Tan,
Steven Klooster, Chris Potter, Alicia Torregrosa,
Mining Scientific Data Discovery of Patterns in
the Global Climate System, Joint Statistical
Meetings, Atlanta, GA, 2001. - Michael Steinbach, Pang-Ning Tan, Vipin Kumar,
Chris Potter, Steven Klooster, Data Mining for
the Discovery of Ocean Climate Indices,
Workshop on Mining Scientific Data, SDM 2002.
62Statistical Issues
- Temporal Autocorrelation
- Makes it difficult to calculate degrees of
freedom and determine significance levels for
tests, e.g., non-zero correlation. - Moving average is nice for smoothing and seeing
the overall behavior, but introduces additional
autocorrelation. - Removal of seasonality removes much of the
autocorrelation (as long as not performed via the
moving average). - Measures of time series similarity
- Detecting non-linear connections
- Detecting connections that only exist at certain
times. - Sometimes only extreme events have an effect.
- Automatically detecting appropriate time lags.
- Statistical tests for more sophisticated measures.
63Statistical Issues
- Detecting spurious connections.
- We are performing many correlation calculations
and there is a chance of spurious correlations. - Given that we have 100,000 locations on the
Earth for which we have time series, how many
spuriously high correlations will we get when we
calculate the correlation between these locations
and a climate index? - Because of spatial autocorrelation, these
correlations are not independent. - Again we have trouble calculating the degrees of
freedom.