Recent Advances in Data Mining and Applications for Heliophysics - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Recent Advances in Data Mining and Applications for Heliophysics

Description:

For example, facilitating data-intensive science is a focus of the Goddard space ... Scientific data: genomics, space science, physics, etc. Web, text, and e-commerce ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 54
Provided by: drkirk
Category:

less

Transcript and Presenter's Notes

Title: Recent Advances in Data Mining and Applications for Heliophysics


1
Recent Advances in Data Mining and Applications
for Heliophysics
  • Kirk D. Borne

George Mason University and QSS Group Inc.,
NASA-Goddard kborne_at_gmu.edu or
kirk.borne_at_gsfc.nasa.gov http//rings.gsfc.nasa.go
v/nvo_datamining.html
2
Recent Advances in Data Mining and Applications
for Solar and Space PhysicsLSSP Seminar GSFC
Code 612 June 9, 2006Kirk Borne (QSS / SSDOO)
  • ABSTRACT Modern approaches to long-standing
    scientific research problems now rely heavily
    upon novel computational techniques. Much of this
    is driven by a common research challenge that
    pervades most disciplines the growing volumes of
    scientific data that need to be processed and
    analyzed. Relevant computational techniques
    include data mining, evolutionary computing, and
    high-performance computing (HPC). For example,
    facilitating data-intensive science is a focus of
    the Goddard space sciences HPC initiative.
    Several examples of scientific data mining in
    large data sets will be presented from the
    author's own astronomy research. Additional
    examples will be given from the author's
    collaborations in the fields of data mining of
    remote sensing data for wildfire detection and
    data mining within Solar coronal mass ejection
    data sets. The goals of the talk are to
    illustrate and to motivate collaborative research
    opportunities across the LSSP, involving
    scientific discovery within existing and upcoming
    large solar and space physics mission data
    collections. Our goals would be (a) to
    demonstrate and augment the legacy value of the
    tremendous investment of resources that have gone
    into the acquisition of large NASA mission data
    sets and (b) to reap the maximum scientific
    benefit from those investments.
  • BIO Dr. Kirk Borne has a PhD in Astronomy from
    Caltech, and he subsequently had positions at the
    University of Michigan, Carnegie's Department of
    Terrestrial Magnetism, Space Telescope Science
    Institute, and Hughes/Raytheon STX in Goddard's
    Code 631. He currently works for QSS Group Inc.
    as Program Manager for Goddard's SSDOO support
    contract, managing staff in Codes 612.4, 690.1,
    and 605. Dr. Borne is also Associate Research
    Professor of Astrophysics and Computational
    Sciences at George Mason University (GMU) in
    Fairfax Virginia, and he is also Adjunct
    Associate Professor in the Database Technologies
    Program at the University of Maryland University
    College where he teaches a graduate course in
    data mining. He is a senior member of the U.S.
    National Virtual Observatory (NVO) project and of
    the planned Large Synoptic Survey Telescope
    project. His research interests include
    extragalactic astronomy, numerical modeling,
    scientific data mining, computational science,
    and science education technologies.

3
OUTLINE
  • The New Face of Science
  • Heliophysics (Data) Environment
  • Knowledge Discovery
  • Data Mining Examples and Techniques
  • Discovery Informatics for Large Database Science
  • Heliophysics Example

4
OUTLINE
  • The New Face of Science
  • Heliophysics (Data) Environment
  • Knowledge Discovery
  • Data Mining Examples and Techniques
  • Discovery Informatics for Large Database Science
  • Heliophysics Example

5
The New Face of Science 1
  • Big Data (usually geographically distributed)
  • High-Energy Particle Physics
  • Astronomy and Space Physics
  • Earth Observing System (Remote Sensing)
  • Human Genome and Bioinformatics
  • Numerical Simulations of any kind
  • Digital Libraries (electronic publication
    repositories)
  • e-Science
  • Built on Web Services (e-Gov, e-Biz) paradigm
  • Distributed heterogeneous data are the norm
  • Data integration across projects institutions
  • One-stop shopping The right data, right now.

6
The New Face of Science 2
  • Databases enable scientific discovery
  • Data Handling and Archiving (management of
    massive data resources)
  • Data Discovery (finding data wherever they exist)
  • Data Access (WWW-Database interfaces)
  • Data/Metadata Browsing (serendipity)
  • Data Sharing and Reuse (within project teams and
    by other scientists scientific validation)
  • Data Integration (from multiple sources)
  • Data Fusion (across multiple modalities
    domains)
  • Data Mining (KDD Knowledge Discovery in
    Databases)

7
The Promise of e-Science
  • The best of Google and Amazon.com
  • Go to one place to shop for all your data needs
  • Use scientific indexing (through scientific
    metadata)
  • Find the data that you need
  • Ignore data that are not relevant
  • Recommend also relevant data sets
  • Access distributed data seamlessly
    (transparently)
  • Integrate multiple data sets
  • Integrate data sets into analysis/visualization
    software packages
  • Provide value-added services
  • Provide intelligence within the archive
  • Provide intelligence at the point of service

8
OUTLINE
  • The New Face of Science
  • Heliophysics (Data) Environment
  • Knowledge Discovery
  • Data Mining Examples and Techniques
  • Discovery Informatics for Large Database Science
  • Heliophysics Example

9
Sun-Earth Space Environment Rich Source of
Heliophysical Phenomena
10
Multi-point Observations and Models of Space
Plasmas Deliver a Deluge of Physical Measurements
11
(No Transcript)
12
Space Science data volumes aregrowing and
growing and
  • a few terabytes "yesterday (10,000 CDROMs)
  • tens of terabytes "today (100,000 CDROMs)
  • 100s of petabytes "tomorrow"
    (within 10-20 years) (1,000,000,000 CDROMs)

13
Technological Advances the cause and the
solution?
14
Data Access and Analysis Tools are Essential,
but do not scale well with Exponential Data
Growth
15
The Data Flood is Everywhere!
  • Huge quantities of data are being generated in
    all business, government, and research domains
  • Banking, retail, marketing, telecommunications,
    homeland security, computer networks, other
    business transactions ...
  • Scientific data genomics, space science,
    physics, etc.
  • Web, text, and e-commerce

16
(Credit Tim Eastman)
17
OUTLINE
  • The New Face of Science
  • Heliophysics (Data) Environment
  • Knowledge Discovery
  • Data Mining Examples and Techniques
  • Discovery Informatics for Large Database Science
  • Heliophysics Example

18
How do we learn about our Universe and the World
around us?
Data ? Information ? Knowledge ? Understanding /
Wisdom!
WE GATHER INFORMATION, FROM WHICH WE DERIVE
KNOWLEDGE, FROM WHICH WE LEARN WHAT IT ALL MEANS
19
Data-Information-Knowledge-Wisdom
  • T.S. Eliot (1934)
  • Where is the wisdom we have lost in knowledge?
  • Where is the knowledge we have lost in
    information?

20
Astronomy Example
Data
(a) Imaging data (ones zeroes)
(b) Spectral data (ones zeroes)
  • Information (catalogs / databases)
  • Measure brightness of galaxies from image (e.g.,
    14.2 or 21.7)
  • Measure redshift of galaxies from spectrum (e.g.,
    0.0167 or 0.346)

Knowledge Hubble Diagram ? Redshift-Brightness
Correlation ? Redshift Distance
Understanding the Universe is expanding!!
21
So what is Data Mining?
  • Data Mining is Knowledge Discovery in Databases
    (KDD)
  • Data mining is defined as an information
    extraction activity whose goal is to discover
    hidden facts contained in (large) databases."

22
OUTLINE
  • The New Face of Science
  • Heliophysics (Data) Environment
  • Knowledge Discovery
  • Data Mining Examples and Techniques
  • Discovery Informatics for Large Database Science
  • Heliophysics Example

23
Data Mining
  • Data Mining is the Killer App for Scientific
    Databases.
  • Scientific Data Mining References
  • http//rings.gsfc.nasa.gov/nvo_datamining.html
  • http//www.itsc.uah.edu/f-mass/
  • Framework for Mining and Analysis of Space
    Science data (F-MASS)
  • Data mining is used to find patterns and
    relationships in data. (EDA Exploratory Data
    Analysis)
  • Patterns can be analyzed via 2 types of models
  • Descriptive Describe patterns and to create
    meaningful subgroups or clusters. (Unsupervised
    Learning, Clustering)
  • Predictive Forecast explicit values, based
    upon patterns in known results. (Supervised
    Learning, Classification)
  • How does this apply to Scientific Research?
  • through KNOWLEDGE DISCOVERY
  • Data ? Information ? Knowledge ?
    Understanding / Wisdom!

24
Data Mining is a core database function
  • Data Mining has many names / aliases
  • Knowledge Discovery in Databases (KDD)
  • Machine Learning (ML)
  • Exploratory Data Analysis (EDA)
  • Intelligent Data Analysis (IDA)
  • On-Line Analytical Processing (OLAP)
  • Business Intelligence (BI)
  • Customer Relationship Management (CRM)
  • Business Analytics
  • Target Marketing
  • Cross-Selling
  • Market Basket Analysis
  • Credit Scoring
  • Case-Based Reasoning (CBR)
  • Connecting the Dots
  • Intrusion Detection Systems (IDS)
  • Recommendation / Personalization Systems!

25
Data Mining is Ready for Prime Time
  • Why are Data Mining Knowledge Discovery such
    hot topics? -- because of the enormous interest
    in existing huge databases and their potential
    for new discoveries.
  • Data mining is ready for general application
    because it engages three technologies that are
    now sufficiently mature
  • Massive data collection delivery
  • Powerful multiprocessor computers
  • Sophisticated data mining algorithms
  • 5 Reasons to use Data Mining
  • Most agencies collect and refine massive
    quantities of data.
  • Data mining moves beyond the analysis of past
    events to predicting future trends and
    behaviors that may be missed because they lie
    outside the experts expectations.
  • Data mining tools can answer complex questions
    that traditionally were too time- consuming to
    resolve.
  • Data mining tools can explore the intricate
    interdependencies within databases in order to
    discover hidden patterns and relationships.
  • Data mining allows decision-makers to make
    proactive, knowledge-driven decisions.

26
Examples of real Data Mining in Action
  • Classic Textbook Example of Data Mining
    (Legend?) Data mining of grocery store logs
    indicated that men who buy diapers also tend to
    buy beer at the same time.
  • Blockbuster Entertainment mines its video rental
    history database to recommend rentals to
    individual customers.
  • Astronomers examined objects with extreme colors
    in a huge database to discover the most distant
    Quasars ever seen.
  • Credit card companies recommend products to
    cardholders based on analysis of their monthly
    expenditures.
  • Airline purchase transaction logs revealed that
    9-11 hijackers bought one-way airline tickets
    with the same credit card.
  • Wal-Mart studied product sales in their Florida
    stores in 2004 when several hurricanes passed
    through Florida. Wal-Mart found that, before the
    hurricanes arrived, people purchased 7 times as
    many strawberry pop tarts compared to normal
    shopping days.

27
Strawberry pop tarts???
28
Astronomy Data Mining in Action
Exploringthe Time Domain
Mega-Flares on normal Sun-like stars a star
like our Sun increased in brightness 300X one
night! say what??
29
Data Mining Methods and Some Examples
  • Clustering
  • Classification
  • Associations
  • Neural Nets
  • Decision Trees
  • Pattern Recognition
  • Correlation/Trend Analysis
  • Principal Component Analysis
  • Independent Component Analysis
  • Regression Analysis
  • Outlier/Glitch Identification
  • Visualization
  • Autonomous Agents
  • Self-Organizing Maps (SOM)
  • Link (Affinity Analysis)

Group together similar items and separate
dissimilar items in DB
Classify new data items using the known classes
groups
Find unusual co-occurring associations of
attribute values among DB items
Predict a numeric attribute value
Organize information in the database based on
relationships among key data descriptors
Identify linkages between data items based on
features shared in common
30
Some Data Mining Techniques Graphically
Represented
  • Self-Organizing Map (SOM)

Clustering
Neural Network
Outlier (Anomaly) Detection
Link Analysis
Decision Tree
31
Data Mining Application Outlier Detection
Figure The clustering of data clouds (dc)
within a multidimensional parameter space
(p). Such a mapping can be used to search for
and identify clusters, voids, outliers,
one-of-kinds, relationships, and associations
among arbitrary parameters in a database (or
among various parameters in geographically
distributed databases).
  • statistical analysis of typical events
  • automated search for rare events

32
Outlier DetectionSerendipitous Discovery of
Rare or New Objects Events
33
Learning From Legacy Temporal Data (Time
Series)Classify New Data (Bayes Analysis or
Markov Modeling)
34
Principal Components Analysis Independent
Components Analysis
Cepheid Variables Cosmic Yardsticks -- One
Correlation -- Two Classes!
35
Classification MethodsDecision Trees, Neural
Networks, SVM (Support Vector Machines)
  • There are 2 Classes!
  • How do you ...
  • Separate them?
  • Distinguish them?
  • Learn the rules?
  • Classify them?

Apply Kernel
(SVM)
36
Data Mining For Exploration, Discovery, and
Decision Support (in science, government,
homeland security, and business)
37
Sample Space Science Data Mining Use Cases
  • Discover data stored in geographically
    distributed heterogeneous systems.
  • Search huge databases for trends and correlations
    in high-dimensional parameter spaces identify
    new properties or new classes of scientific
    objects.
  • Discover new linkages associations among data
    parameters.
  • Search for rare, one-of-a-kind, and exotic
    objects in huge databases.
  • Identify repeating patterns of temporal
    variations from millions or billions of
    observations.
  • Identify parameter glitches / anomalies /
    deviations either in static databases (e.g.,
    archives) or in dynamic data (e.g., science /
    instrumental / engineering data streams).
  • Find clusters, nearest neighbors, outliers,
    and/or zones of avoidance in the distribution of
    objects or other observables in arbitrary
    parameter spaces.
  • Serendipitously explore huge scientific databases
    through access to distributed, autonomous,
    federated, heterogeneous, multi-experiment,
    multi-institutional scientific data archives.

38
OUTLINE
  • The New Face of Science
  • Heliophysics (Data) Environment
  • Knowledge Discovery
  • Data Mining Examples and Techniques
  • Discovery Informatics for Large Database Science
  • Heliophysics Example

39
Existing Space Science Data Infrastructure
  • The Recent Past many independent distributed
    heterogeneous data archives
  • Today VxOs Virtual Observatories
  • Web Services-enabled e-Science paradigm
    (middleware, standards, protocols)
  • Provides seamless uniform access to distributed
    heterogenous data sources
  • Find the right data, right now
  • One-stop shopping for all of your data needs
  • Emerging environment consists of many VxOs for
    example
  • NVO National Virtual Observatory (precursor to
    VAO Virtual Astro Obs)
  • VSO Virtual Solar Observatory
  • VSPO Virtual Space Physics Observatory
  • NVAO National Virtual Aeronomy Observatory
  • VITMO Virtual Ionospheric, Thermospheric,
    Magnetospheric Observatory
  • VHO Virtual Heliospheric Observatory
  • VMO Virtual Magnetospheric Observatory
  • Standards for data formats, data/metadata
    exchange, data models, registries, Web Services,
    VO queries, query results, semantics
  • And of course The Grid, Web Services,
    Semantic Web, etc. ...

40
Our science data systems should enable
distributed multi-mission database access,
discovery, mining, and analysis.
? DISCOVERY INFORMATICS
41
What is Informatics?
  • Informatics is the discipline of structuring,
    storing, accessing, and distributing information
    describing complex systems.
  • Examples
  • Bioinformatics
  • Geographic Information Systems ( Geoinformatics)
  • New! Discovery Informatics for Space Science
  • Common features of X-informatics
  • Basic object granule is defined
  • Common community tools operate on object granules
  • Data-centric and Information-centric approaches
  • Data-driven science
  • X-informatics is key enabler of scientific
    discovery in the era of large data science

42
X-Informatics Compared
  • Discipline X
  • Bioinformatics
  • Geoinformatics
  • Space Science Informatics
  • Common Tools
  • BLAST, FASTA
  • GIS
  • Classification, Clustering, Bayes
    Inference, Cross Correlations, Principal
    Components, ???
  • Object Granules
  • Gene Sequence
  • Points, Vectors, Polygons
  • Time Series, Event List, Catalog

43
Discovery Informatics
  • Key enabler for new science discovery in large
    databases
  • Essential tool (Large data science is here to
    stay)
  • Common data integration, browse, and discovery
    tools will enable exponential knowledge discovery
    within exponentially growing data collections
  • X-informatics represents the 3rd leg of
    scientific research experiment, theory, and
    data-driven exploration (Reference Jim Gray,
    KDD-2003)
  • Discovery Informatics should parallel
    Bioinformatics and Geoinformatics become a
    stand-alone research sub-discipline

44
Key Role of Data Mining
  • Data Mining (KDD) is the killer app for
    scientific databases
  • Space and Earth Science Examples
  • Neural Network for Pixel Classification Event
    Detection and Prediction (e.g., Wildfires)
  • Bayesian Network for Object Classification
  • PCA for finding Fundamental Planes of Galaxy
    Parameters
  • PCA (weakest component) for Outlier Detection
    anomalies, novel discoveries, new objects
  • Link Analysis (Association Mining) for Causal
    Event Detection (e.g., linking Solar Surface,
    CME, and Space Weather events)
  • Clustering analysis Spatial, Temporal, or any
    scientific database parameters
  • Markov models Temporal mining of time series
    data

45
Space Science Knowledge Discovery
46
This is the Informatics Layer
47
This is the Informatics Layer
  • Informatics Layer
  • Provides standardized representations of the
    information extracted for use in the KDD
    (data mining) layer.
  • Standardization is not required (nor feasible) at
    the data source layer.
  • The informatics is discipline-specific.
  • Informatics enables KDD across large distributed
    heterogeneous scientific data repositories.

48
Space Weather Example
CME Coronal Mass Ejection SEP Solar Energetic
Particle
49
Key Role of Discovery Informatics
  • The key role of Discovery Informatics is
  • ... data integration and fusion ...
  • ... across multiple heterogeneous data
    collections ...
  • ... to enable scientific knowledge discovery ...
  • ... and decision support.

50
Future Work Discovery Informatics Applications
  • Query-By-Example (QBE) science data systems
  • Find more data entries similar to this one
  • Find the data entry most dissimilar to this one
  • Automated Recommendation (Filtering) Systems
  • Other users who examined these data also
    retrieved the following...
  • Other data that are relevant to these data
    include...
  • Information Retrieval Metrics for Scientific
    Databases
  • Precision How much of the retrieved data is
    relevant to my query?
  • Recall How much of the relevant data did my
    query retrieve?
  • Semantic Annotation (Tagging) Services
  • Report discoveries back to the science database
    for community reuse
  • Science / Technical / Math (STEM) Education
  • Transparent reuse and analysis of scientific data
    in inquiry-based classroom learning
    (http//serc.carleton.edu/usingdata/ , DLESE.org
    )
  • Key concepts that need defining (by community
    consensus) Similarity, Relevance, Semantics
    (dictionaries, ontologies)

51
(No Transcript)
52
(science knowledge sharing re-use)
()
()
( Repositories of information, knowledge, and
scientific results.)
53
Informatics Synergy between Scientific
Measurement, Mining, and Modeling
54
Data Mining and Discovery InformaticsIt is more
than just connecting the dots
Reference http//homepage.interaccess.com/purc
ellm/lcas/Cartoons/cartoons.htm
Write a Comment
User Comments (0)
About PowerShow.com