Title: Astronomy Data Collections of the Future: Massive Data Mining Opportunities
1Astronomy Data Collections of the Future Massive
Data Mining Opportunities
George Mason University and QSS Group Inc.,
NASA-Goddard kborne_at_gmu.edu or kirk.borne_at_gsf
c.nasa.gov http//rings.gsfc.nasa.gov/borne/nvo_
datamining.html
2Discovery Informatics for Large Database Astronomy
George Mason University and QSS Group Inc.,
NASA-Goddard kborne_at_gmu.edu or kirk.borne_at_gsf
c.nasa.gov http//rings.gsfc.nasa.gov/borne/nvo_
datamining.html
3Astronomy Data Collections of the Future Massive
Data Mining Opportunities
- INTERFACE 2006 Symposium on Massive Data Sets and
Streams
- Pasadena, CA
- http//www.galaxy.gmu.edu/Interface2006/i2006webpa
ge.html
- Thursday May 25 Session 1030am-1200noon
- San Marino Room, Westin Pasadena, 191 North Los
Robles Avenue
- Distributed Data, Parallel Computing, and
Computational Strategies I (David Marchette,
chair)
- Kirk Borne (George Mason University) Astronomy
Data Collections of the Future Massive Data
Mining Opportunities Thursday 1030-1050, San
Marino Abstract Astronomers are producing
ever-increasing data volumes from a variety of
large projects, including NASA missions and
ground-based telescope projects. One of the
largest of these will be the Large Synoptic
Survey Telescope (LSST) project, which is planned
to produce nearly 30 Terabytes of data per night
of observation, every night, for 10 years. The
LSST will image the entire visible sky repeatedly
every 4 nights. Astronomers refer to the
resulting combined data set as "cosmic
cinematography". Other similar projects are also
envisioned. The temporal, spatial, and
high-dimensional data mining opportunities posed
by this massive data environment are exploding
before us. Consequently, the scientific discovery
potential of these data sets is enormous. All of
these large astronomical data repositories will
be geographically distributed. To facilitate
access to these distributed data collections,
astronomers are working with database experts and
computer scientists to build a worldwide virtual
data system the National Virtual Observatory
(NVO). I will describe the NVO and LSST projects,
plus other large astronomy data-producing
projects, including some of the corresponding
discipline-driven research challenges, and
finally some astronomical data mining activities
now underway. It is anticipated that an
Astro-informatics research paradigm will evolve
from this -- in fact, this is already becoming
important and soon will become imperative.
Grid-based mining, Web Services-enabled mining,
and ontology-enhanced semantic mining will all
play a role in the astronomical research of the
future. - Send proceedings manuscript to ysaid99_at_hotmail.com
by July 31, 2006 (Yasmin Said).
4Existing Astronomy Space Science Data
Infrastructure
- The Recent Past many independent distributed
heterogeneous data archives
- Today NVO the National Virtual
Observatory ( http//www.us-vo.org/ )
- Web Services-enabled e-Science paradigm
(middleware, standards, protocols)
- Part of the IVOA (International VO Alliance _at_
IVOA.net )
- Precursor to VAO Virtual Astronomical
Observatory (NSFNASA co-funded)
- Provides seamless uniform access to distributed
heterogenous data sources
- Find the right data, right now
- One-stop shopping for all of your data needs
- One of many VxOs for example
- VSO Virtual Solar Observatory
- VSPO Virtual Space Physics Observatory
- NVAO National Virtual Aeronomy Observatory
- VITMO Virtual Ionospheric, Thermospheric,
Magnetospheric Observatory
- VHO Virtual Heliospheric Observatory
- IVOA-approved standards for data formats,
data/metadata exchange, data models, registries,
Web Services, VO queries, query results, semantic
astronomical catalog table headings (UCDs
Uniform Content Descriptors) - And of course The Grid, Web Services,
Semantic Web, etc. ...
5Massive Astronomy Data Collections
- NVO (IVOA) registry of 14,000 data resources
(collections, repositories, services)
- Large Astronomy Data Archives (including NASA
space astronomy missions)
- Large Astronomy Sky Surveys (past and present)
uniform data sets, including
- MACHO and related surveys for dark matter
objects 1 Terabyte
- DPOSS (Digital Palomar Observatory Sky Survey)
3 Terabytes
- 2MASS (2-Micron All Sky Survey) 10 Terabytes
- GALEX (Ultraviolet Space Telescope) 30
Terabytes
- SDSS (Sloan Digital Sky Survey) 40 Terabytes
- Future Massive Astronomy Sky Surveys,
including
- PanSTARRS 10 Terabytes per night, 40 Petabyte
final archive anticipated
- LSST (Large Synoptic Survey Telescope _at_
http//www.lssto.org/ )
- Begin operations in 2012, with 3-Gigapixel
camera
- 10 Gigabytes every 30 seconds
- 30 Terabytes every night for 10 years
- 100 Petabyte final image data archive anticipated
all data are public!!!
- 30 Petabyte final catalog anticipated
- Real-Time Event Mining 10,000-100,000 events
per night, every night, for 10 yrs
- Repeat images of the entire night sky every 3
nights Celestial Cinematography
6NASA Astronomy Mission Datathe tip of the data
mountain
NSSDCs astrophysics data holdings
One of many science data collections for astron
omy
across the US and the world!
NSSDC National Space Science Data Center _at_ N
ASA/GSFC
http//nssdc.gsfc.nasa.gov/astro/astrolist.html
7The Electromagnetic Spectrum
- Radiation is the Astronomers only source of
information about the Universe!
- And it is a remarkably rich diverse source!
8Why so many Telescopes?
9Why so many Telescopes?
Because
- Many great astronomical
- discoveries have come
- from inter-comparisons
- of various wavelengths
- Quasars
- Gamma-ray bursts
- Ultraluminous IR galaxies
- X-ray black-hole binaries
- Radio galaxies
- . . .
10Therefore, our science data systems should enable
multi-wavelength distributed database access,
discovery, mining, and analysis.
? DISCOVERY INFORMATICS
11What is Informatics?
- Informatics is the discipline of structuring,
storing, accessing, and distributing information
describing complex systems.
- Examples
- Bioinformatics
- Geographic Information Systems (
Geoinformatics)
- New! Discovery Informatics for Astronomy (
Astroinformatics)
- Common features of X-informatics
- Basic object granule is defined
- Common community tools operate on object
granules
- Data-centric and Information-centric approaches
- Data-driven science
- X-informatics is key enabler of scientific
discovery in the era of large data science
12X-Informatics Compared
- Discipline X
- Bioinformatics
- Geoinformatics
- Astroinformatics
- Common Tools
- BLAST, FASTA
- GIS
- Classification, Clustering, Bayes
Inference, Cross Correlations, Principal
Components, ???
- Object Granules
- Gene Sequence
- Points, Vectors, Polygons
- Time Series, Event List, Catalog, Astronomical
Object
13General Themes in Informatics Research
- Information and knowledge processing, including
natural language processing, information
extraction, integration of data from
heterogeneous sources or domains, event
detection, feature recognition - Tools for analyzing and/or storing very large
datasets, data supporting ongoing experiments,
and other data used in scientific research
- Knowledge representation, including vocabularies,
ontologies, simulations, and virtual reality
- Linkage of experimental and model results to
benefit research
- Innovative uses of information technology in
science applications, including decision support,
error reduction, outcomes analysis, and
information at the point of end-use - Efficient management and utilization of
information and data, including knowledge
acquisition and management, process modeling,
data mining, acquisition and dissemination, novel
visual presentations, and stewardship of
large-scale data repositories and archives - Human-machine interaction, including interface
design, use and understanding of science
discipline-specific information, intelligent
agents, information needs and uses. - High-performance computing and communications
relating to scientific applications, including
efficient machine-machine interfaces,
transmission and storage, real-time decision
support - Innovative uses of information technology to
enhance learning, retention and understanding of
science discipline-specific information.
14Discovery Informatics
- Key enabler for new science discovery in large
databases
- Essential tool (Large data science is here to
stay)
- Common data integration, browse, and discovery
tools will enable exponential knowledge discovery
within exponentially growing data collections
- X-informatics represents the 3rd leg of
scientific research experiment, theory, and
data-driven exploration (Reference Jim Gray,
KDD-2003) - Discovery Informatics should parallel
Bioinformatics and Geoinformatics become a
stand-alone research sub-discipline
15Key Role of Data Mining
- Data Mining (KDD) is the killer app for
scientific databases
- Space and Earth Science Examples
- Neural Network for Pixel Classification Event
Detection and Prediction (e.g., Wildfires)
- Bayesian Network for Object Classification
- PCA for finding Fundamental Planes of Galaxy
Parameters
- PCA (weakest component) for Outlier Detection
anomalies, novel discoveries, new objects
- Link Analysis (Association Mining) for Causal
Event Detection (e.g., linking Solar Surface,
CME, and Space Weather events)
- Clustering analysis Spatial, Temporal, or any
scientific database parameters
- Markov models Temporal mining of time series
data
16Space Science Knowledge Discovery
17This is the Informatics Layer
18This is the Informatics Layer
- Informatics Layer
- Provides standardized representations of the
information extracted for use in the KDD
(data mining) layer.
- Standardization is not required (nor feasible) at
the data source layer.
- The informatics is discipline-specific.
- Informatics enables KDD across large distributed
heterogeneous scientific data repositories.
19Space Weather Example
CME Coronal Mass Ejection SEP Solar Energetic
Particle
20Key Role of Discovery Informatics
- The key role of Discovery Informatics is
- ... data integration and fusion ...
- ... across multiple heterogeneous data
collections ...
- ... to enable scientific knowledge discovery ...
- ... and decision support.
21Personal Reflection
- Distributed Data Mining
- ( mining of distributed data)
- versus
- Distributed Data Mining
- ( distributed mining of distributed data)
22Future Work Discovery Informatics Applications
- Query-By-Example (QBE) science data systems
- Find more data entries similar to this one
- Find the data entry most dissimilar to this
one
- Automated Recommendation (Filtering) Systems
- Other users who examined these data also
retrieved the following...
- Other data that are relevant to these data
include...
- Information Retrieval Metrics for Scientific
Databases
- Precision How much of the retrieved data is
relevant to my query?
- Recall How much of the relevant data did my
query retrieve?
- Semantic Annotation (Tagging) Services
- Report discoveries back to the science database
for community reuse
- Science / Technical / Math (STEM) Education
- Transparent reuse and analysis of scientific data
in inquiry-based classroom learning
(http//serc.carleton.edu/usingdata/ , DLESE.org
) - Key concepts that need defining (by community
consensus) Similarity, Relevance, Semantics
(dictionaries, ontologies)
23Data Mining and Discovery InformaticsIt is more
than just connecting the dots
Reference http//homepage.interaccess.com/purc
ellm/lcas/Cartoons/cartoons.htm