Astronomy Data Collections of the Future: Massive Data Mining Opportunities - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Astronomy Data Collections of the Future: Massive Data Mining Opportunities

Description:

Astronomy Data Collections of the Future: Massive Data Mining Opportunities ... San Marino Room, Westin Pasadena, 191 North ... 10:50, San Marino. Abstract: ... – PowerPoint PPT presentation

Number of Views:190

Avg rating:3.0/5.0

Slides: 22

Provided by: drkirk

Category:

more less

Transcript and Presenter's Notes

Title: Astronomy Data Collections of the Future: Massive Data Mining Opportunities

1
Astronomy Data Collections of the Future Massive
Data Mining Opportunities

Kirk D. Borne

George Mason University and QSS Group Inc.,
NASA-Goddard kborne_at_gmu.edu or kirk.borne_at_gsf
c.nasa.gov http//rings.gsfc.nasa.gov/borne/nvo_
datamining.html
2
Discovery Informatics for Large Database Astronomy

Kirk D. Borne

George Mason University and QSS Group Inc.,
NASA-Goddard kborne_at_gmu.edu or kirk.borne_at_gsf
c.nasa.gov http//rings.gsfc.nasa.gov/borne/nvo_
datamining.html
3
Astronomy Data Collections of the Future Massive
Data Mining Opportunities

INTERFACE 2006 Symposium on Massive Data Sets and
Streams
Pasadena, CA
http//www.galaxy.gmu.edu/Interface2006/i2006webpa
ge.html
Thursday May 25 Session 1030am-1200noon
San Marino Room, Westin Pasadena, 191 North Los
Robles Avenue
Distributed Data, Parallel Computing, and
Computational Strategies I (David Marchette,
chair)
Kirk Borne (George Mason University) Astronomy
Data Collections of the Future Massive Data
Mining Opportunities Thursday 1030-1050, San
Marino Abstract Astronomers are producing
ever-increasing data volumes from a variety of
large projects, including NASA missions and
ground-based telescope projects. One of the
largest of these will be the Large Synoptic
Survey Telescope (LSST) project, which is planned
to produce nearly 30 Terabytes of data per night
of observation, every night, for 10 years. The
LSST will image the entire visible sky repeatedly
every 4 nights. Astronomers refer to the
resulting combined data set as "cosmic
cinematography". Other similar projects are also
envisioned. The temporal, spatial, and
high-dimensional data mining opportunities posed
by this massive data environment are exploding
before us. Consequently, the scientific discovery
potential of these data sets is enormous. All of
these large astronomical data repositories will
be geographically distributed. To facilitate
access to these distributed data collections,
astronomers are working with database experts and
computer scientists to build a worldwide virtual
data system the National Virtual Observatory
(NVO). I will describe the NVO and LSST projects,
plus other large astronomy data-producing
projects, including some of the corresponding
discipline-driven research challenges, and
finally some astronomical data mining activities
now underway. It is anticipated that an
Astro-informatics research paradigm will evolve
from this -- in fact, this is already becoming
important and soon will become imperative.
Grid-based mining, Web Services-enabled mining,
and ontology-enhanced semantic mining will all
play a role in the astronomical research of the
future.
Send proceedings manuscript to ysaid99_at_hotmail.com
by July 31, 2006 (Yasmin Said).

4
Existing Astronomy Space Science Data
Infrastructure

The Recent Past many independent distributed
heterogeneous data archives
Today NVO the National Virtual
Observatory ( http//www.us-vo.org/ )
Web Services-enabled e-Science paradigm
(middleware, standards, protocols)
Part of the IVOA (International VO Alliance _at_
IVOA.net )
Precursor to VAO Virtual Astronomical
Observatory (NSFNASA co-funded)
Provides seamless uniform access to distributed
heterogenous data sources
Find the right data, right now
One-stop shopping for all of your data needs
One of many VxOs for example
VSO Virtual Solar Observatory
VSPO Virtual Space Physics Observatory
NVAO National Virtual Aeronomy Observatory
VITMO Virtual Ionospheric, Thermospheric,
Magnetospheric Observatory
VHO Virtual Heliospheric Observatory
IVOA-approved standards for data formats,
data/metadata exchange, data models, registries,
Web Services, VO queries, query results, semantic
astronomical catalog table headings (UCDs
Uniform Content Descriptors)
And of course The Grid, Web Services,
Semantic Web, etc. ...

5
Massive Astronomy Data Collections

NVO (IVOA) registry of 14,000 data resources
(collections, repositories, services)
Large Astronomy Data Archives (including NASA
space astronomy missions)
Large Astronomy Sky Surveys (past and present)
uniform data sets, including
MACHO and related surveys for dark matter
objects 1 Terabyte
DPOSS (Digital Palomar Observatory Sky Survey)
3 Terabytes
2MASS (2-Micron All Sky Survey) 10 Terabytes
GALEX (Ultraviolet Space Telescope) 30
Terabytes
SDSS (Sloan Digital Sky Survey) 40 Terabytes
Future Massive Astronomy Sky Surveys,
including
PanSTARRS 10 Terabytes per night, 40 Petabyte
final archive anticipated
LSST (Large Synoptic Survey Telescope _at_
http//www.lssto.org/ )
Begin operations in 2012, with 3-Gigapixel
camera
10 Gigabytes every 30 seconds
30 Terabytes every night for 10 years
100 Petabyte final image data archive anticipated
all data are public!!!
30 Petabyte final catalog anticipated
Real-Time Event Mining 10,000-100,000 events
per night, every night, for 10 yrs
Repeat images of the entire night sky every 3
nights Celestial Cinematography

6
NASA Astronomy Mission Datathe tip of the data
mountain
NSSDCs astrophysics data holdings
One of many science data collections for astron
omy
across the US and the world!
NSSDC National Space Science Data Center _at_ N
ASA/GSFC
http//nssdc.gsfc.nasa.gov/astro/astrolist.html
7
The Electromagnetic Spectrum

Radiation is the Astronomers only source of
information about the Universe!
And it is a remarkably rich diverse source!

8
Why so many Telescopes?
9
Why so many Telescopes?
Because

Many great astronomical
discoveries have come
from inter-comparisons
of various wavelengths
Quasars
Gamma-ray bursts
Ultraluminous IR galaxies
X-ray black-hole binaries
Radio galaxies
. . .

10
Therefore, our science data systems should enable
multi-wavelength distributed database access,
discovery, mining, and analysis.
? DISCOVERY INFORMATICS
11
What is Informatics?

Informatics is the discipline of structuring,
storing, accessing, and distributing information
describing complex systems.
Examples
Bioinformatics
Geographic Information Systems (
Geoinformatics)
New! Discovery Informatics for Astronomy (
Astroinformatics)
Common features of X-informatics
Basic object granule is defined
Common community tools operate on object
granules
Data-centric and Information-centric approaches
Data-driven science
X-informatics is key enabler of scientific
discovery in the era of large data science

12
X-Informatics Compared

Discipline X
Bioinformatics
Geoinformatics
Astroinformatics

Common Tools
BLAST, FASTA
GIS
Classification, Clustering, Bayes
Inference, Cross Correlations, Principal
Components, ???

Object Granules
Gene Sequence
Points, Vectors, Polygons
Time Series, Event List, Catalog, Astronomical
Object

13
General Themes in Informatics Research

Information and knowledge processing, including
natural language processing, information
extraction, integration of data from
heterogeneous sources or domains, event
detection, feature recognition
Tools for analyzing and/or storing very large
datasets, data supporting ongoing experiments,
and other data used in scientific research
Knowledge representation, including vocabularies,
ontologies, simulations, and virtual reality
Linkage of experimental and model results to
benefit research
Innovative uses of information technology in
science applications, including decision support,
error reduction, outcomes analysis, and
information at the point of end-use
Efficient management and utilization of
information and data, including knowledge
acquisition and management, process modeling,
data mining, acquisition and dissemination, novel
visual presentations, and stewardship of
large-scale data repositories and archives
Human-machine interaction, including interface
design, use and understanding of science
discipline-specific information, intelligent
agents, information needs and uses.
High-performance computing and communications
relating to scientific applications, including
efficient machine-machine interfaces,
transmission and storage, real-time decision
support
Innovative uses of information technology to
enhance learning, retention and understanding of
science discipline-specific information.

14
Discovery Informatics

Key enabler for new science discovery in large
databases
Essential tool (Large data science is here to
stay)
Common data integration, browse, and discovery
tools will enable exponential knowledge discovery
within exponentially growing data collections
X-informatics represents the 3rd leg of
scientific research experiment, theory, and
data-driven exploration (Reference Jim Gray,
KDD-2003)
Discovery Informatics should parallel
Bioinformatics and Geoinformatics become a
stand-alone research sub-discipline

15
Key Role of Data Mining

Data Mining (KDD) is the killer app for
scientific databases
Space and Earth Science Examples
Neural Network for Pixel Classification Event
Detection and Prediction (e.g., Wildfires)
Bayesian Network for Object Classification
PCA for finding Fundamental Planes of Galaxy
Parameters
PCA (weakest component) for Outlier Detection
anomalies, novel discoveries, new objects
Link Analysis (Association Mining) for Causal
Event Detection (e.g., linking Solar Surface,
CME, and Space Weather events)
Clustering analysis Spatial, Temporal, or any
scientific database parameters
Markov models Temporal mining of time series
data

16
Space Science Knowledge Discovery
17
This is the Informatics Layer
18
This is the Informatics Layer

Informatics Layer
Provides standardized representations of the
information extracted for use in the KDD
(data mining) layer.
Standardization is not required (nor feasible) at
the data source layer.
The informatics is discipline-specific.
Informatics enables KDD across large distributed
heterogeneous scientific data repositories.

19
Space Weather Example
CME Coronal Mass Ejection SEP Solar Energetic
Particle
20
Key Role of Discovery Informatics

The key role of Discovery Informatics is
... data integration and fusion ...
... across multiple heterogeneous data
collections ...
... to enable scientific knowledge discovery ...
... and decision support.

21
Personal Reflection

Distributed Data Mining
( mining of distributed data)
versus
Distributed Data Mining
( distributed mining of distributed data)

22
Future Work Discovery Informatics Applications

Query-By-Example (QBE) science data systems
Find more data entries similar to this one
Find the data entry most dissimilar to this
one
Automated Recommendation (Filtering) Systems
Other users who examined these data also
retrieved the following...
Other data that are relevant to these data
include...
Information Retrieval Metrics for Scientific
Databases
Precision How much of the retrieved data is
relevant to my query?
Recall How much of the relevant data did my
query retrieve?
Semantic Annotation (Tagging) Services
Report discoveries back to the science database
for community reuse
Science / Technical / Math (STEM) Education
Transparent reuse and analysis of scientific data
in inquiry-based classroom learning
(http//serc.carleton.edu/usingdata/ , DLESE.org
)
Key concepts that need defining (by community
consensus) Similarity, Relevance, Semantics
(dictionaries, ontologies)

23
Data Mining and Discovery InformaticsIt is more
than just connecting the dots
Reference http//homepage.interaccess.com/purc
ellm/lcas/Cartoons/cartoons.htm

Write a Comment

User Comments (0)