Astronomy Data Collections of the Future: Massive Data Mining Opportunities - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Astronomy Data Collections of the Future: Massive Data Mining Opportunities

Description:

Astronomy Data Collections of the Future: Massive Data Mining Opportunities ... San Marino Room, Westin Pasadena, 191 North ... 10:50, San Marino. Abstract: ... – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 22
Provided by: drkirk
Category:

less

Transcript and Presenter's Notes

Title: Astronomy Data Collections of the Future: Massive Data Mining Opportunities


1
Astronomy Data Collections of the Future Massive
Data Mining Opportunities
  • Kirk D. Borne

George Mason University and QSS Group Inc.,
NASA-Goddard kborne_at_gmu.edu or kirk.borne_at_gsf
c.nasa.gov http//rings.gsfc.nasa.gov/borne/nvo_
datamining.html
2
Discovery Informatics for Large Database Astronomy
  • Kirk D. Borne

George Mason University and QSS Group Inc.,
NASA-Goddard kborne_at_gmu.edu or kirk.borne_at_gsf
c.nasa.gov http//rings.gsfc.nasa.gov/borne/nvo_
datamining.html
3
Astronomy Data Collections of the Future Massive
Data Mining Opportunities
  • INTERFACE 2006 Symposium on Massive Data Sets and
    Streams
  • Pasadena, CA
  • http//www.galaxy.gmu.edu/Interface2006/i2006webpa
    ge.html
  • Thursday May 25 Session 1030am-1200noon
  • San Marino Room, Westin Pasadena, 191 North Los
    Robles Avenue
  • Distributed Data, Parallel Computing, and
    Computational Strategies I (David Marchette,
    chair)
  • Kirk Borne (George Mason University) Astronomy
    Data Collections of the Future Massive Data
    Mining Opportunities Thursday 1030-1050, San
    Marino Abstract Astronomers are producing
    ever-increasing data volumes from a variety of
    large projects, including NASA missions and
    ground-based telescope projects. One of the
    largest of these will be the Large Synoptic
    Survey Telescope (LSST) project, which is planned
    to produce nearly 30 Terabytes of data per night
    of observation, every night, for 10 years. The
    LSST will image the entire visible sky repeatedly
    every 4 nights. Astronomers refer to the
    resulting combined data set as "cosmic
    cinematography". Other similar projects are also
    envisioned. The temporal, spatial, and
    high-dimensional data mining opportunities posed
    by this massive data environment are exploding
    before us. Consequently, the scientific discovery
    potential of these data sets is enormous. All of
    these large astronomical data repositories will
    be geographically distributed. To facilitate
    access to these distributed data collections,
    astronomers are working with database experts and
    computer scientists to build a worldwide virtual
    data system the National Virtual Observatory
    (NVO). I will describe the NVO and LSST projects,
    plus other large astronomy data-producing
    projects, including some of the corresponding
    discipline-driven research challenges, and
    finally some astronomical data mining activities
    now underway. It is anticipated that an
    Astro-informatics research paradigm will evolve
    from this -- in fact, this is already becoming
    important and soon will become imperative.
    Grid-based mining, Web Services-enabled mining,
    and ontology-enhanced semantic mining will all
    play a role in the astronomical research of the
    future.
  • Send proceedings manuscript to ysaid99_at_hotmail.com
    by July 31, 2006 (Yasmin Said).

4
Existing Astronomy Space Science Data
Infrastructure
  • The Recent Past many independent distributed
    heterogeneous data archives
  • Today NVO the National Virtual
    Observatory ( http//www.us-vo.org/ )
  • Web Services-enabled e-Science paradigm
    (middleware, standards, protocols)
  • Part of the IVOA (International VO Alliance _at_
    IVOA.net )
  • Precursor to VAO Virtual Astronomical
    Observatory (NSFNASA co-funded)
  • Provides seamless uniform access to distributed
    heterogenous data sources
  • Find the right data, right now
  • One-stop shopping for all of your data needs
  • One of many VxOs for example
  • VSO Virtual Solar Observatory
  • VSPO Virtual Space Physics Observatory
  • NVAO National Virtual Aeronomy Observatory
  • VITMO Virtual Ionospheric, Thermospheric,
    Magnetospheric Observatory
  • VHO Virtual Heliospheric Observatory
  • IVOA-approved standards for data formats,
    data/metadata exchange, data models, registries,
    Web Services, VO queries, query results, semantic
    astronomical catalog table headings (UCDs
    Uniform Content Descriptors)
  • And of course The Grid, Web Services,
    Semantic Web, etc. ...

5
Massive Astronomy Data Collections
  • NVO (IVOA) registry of 14,000 data resources
    (collections, repositories, services)
  • Large Astronomy Data Archives (including NASA
    space astronomy missions)
  • Large Astronomy Sky Surveys (past and present)
    uniform data sets, including
  • MACHO and related surveys for dark matter
    objects 1 Terabyte
  • DPOSS (Digital Palomar Observatory Sky Survey)
    3 Terabytes
  • 2MASS (2-Micron All Sky Survey) 10 Terabytes
  • GALEX (Ultraviolet Space Telescope) 30
    Terabytes
  • SDSS (Sloan Digital Sky Survey) 40 Terabytes
  • Future Massive Astronomy Sky Surveys,
    including
  • PanSTARRS 10 Terabytes per night, 40 Petabyte
    final archive anticipated
  • LSST (Large Synoptic Survey Telescope _at_
    http//www.lssto.org/ )
  • Begin operations in 2012, with 3-Gigapixel
    camera
  • 10 Gigabytes every 30 seconds
  • 30 Terabytes every night for 10 years
  • 100 Petabyte final image data archive anticipated
    all data are public!!!
  • 30 Petabyte final catalog anticipated
  • Real-Time Event Mining 10,000-100,000 events
    per night, every night, for 10 yrs
  • Repeat images of the entire night sky every 3
    nights Celestial Cinematography

6
NASA Astronomy Mission Datathe tip of the data
mountain
NSSDCs astrophysics data holdings
One of many science data collections for astron
omy
across the US and the world!
NSSDC National Space Science Data Center _at_ N
ASA/GSFC
http//nssdc.gsfc.nasa.gov/astro/astrolist.html
7
The Electromagnetic Spectrum
  • Radiation is the Astronomers only source of
    information about the Universe!
  • And it is a remarkably rich diverse source!

8
Why so many Telescopes?
9
Why so many Telescopes?
Because
  • Many great astronomical
  • discoveries have come
  • from inter-comparisons
  • of various wavelengths
  • Quasars
  • Gamma-ray bursts
  • Ultraluminous IR galaxies
  • X-ray black-hole binaries
  • Radio galaxies
  • . . .

10
Therefore, our science data systems should enable
multi-wavelength distributed database access,
discovery, mining, and analysis.
? DISCOVERY INFORMATICS
11
What is Informatics?
  • Informatics is the discipline of structuring,
    storing, accessing, and distributing information
    describing complex systems.
  • Examples
  • Bioinformatics
  • Geographic Information Systems (
    Geoinformatics)
  • New! Discovery Informatics for Astronomy (
    Astroinformatics)
  • Common features of X-informatics
  • Basic object granule is defined
  • Common community tools operate on object
    granules
  • Data-centric and Information-centric approaches
  • Data-driven science
  • X-informatics is key enabler of scientific
    discovery in the era of large data science

12
X-Informatics Compared
  • Discipline X
  • Bioinformatics
  • Geoinformatics
  • Astroinformatics
  • Common Tools
  • BLAST, FASTA
  • GIS
  • Classification, Clustering, Bayes
    Inference, Cross Correlations, Principal
    Components, ???
  • Object Granules
  • Gene Sequence
  • Points, Vectors, Polygons
  • Time Series, Event List, Catalog, Astronomical
    Object

13
General Themes in Informatics Research
  • Information and knowledge processing, including
    natural language processing, information
    extraction, integration of data from
    heterogeneous sources or domains, event
    detection, feature recognition
  • Tools for analyzing and/or storing very large
    datasets, data supporting ongoing experiments,
    and other data used in scientific research
  • Knowledge representation, including vocabularies,
    ontologies, simulations, and virtual reality
  • Linkage of experimental and model results to
    benefit research
  • Innovative uses of information technology in
    science applications, including decision support,
    error reduction, outcomes analysis, and
    information at the point of end-use
  • Efficient management and utilization of
    information and data, including knowledge
    acquisition and management, process modeling,
    data mining, acquisition and dissemination, novel
    visual presentations, and stewardship of
    large-scale data repositories and archives
  • Human-machine interaction, including interface
    design, use and understanding of science
    discipline-specific information, intelligent
    agents, information needs and uses.
  • High-performance computing and communications
    relating to scientific applications, including
    efficient machine-machine interfaces,
    transmission and storage, real-time decision
    support
  • Innovative uses of information technology to
    enhance learning, retention and understanding of
    science discipline-specific information.

14
Discovery Informatics
  • Key enabler for new science discovery in large
    databases
  • Essential tool (Large data science is here to
    stay)
  • Common data integration, browse, and discovery
    tools will enable exponential knowledge discovery
    within exponentially growing data collections
  • X-informatics represents the 3rd leg of
    scientific research experiment, theory, and
    data-driven exploration (Reference Jim Gray,
    KDD-2003)
  • Discovery Informatics should parallel
    Bioinformatics and Geoinformatics become a
    stand-alone research sub-discipline

15
Key Role of Data Mining
  • Data Mining (KDD) is the killer app for
    scientific databases
  • Space and Earth Science Examples
  • Neural Network for Pixel Classification Event
    Detection and Prediction (e.g., Wildfires)
  • Bayesian Network for Object Classification
  • PCA for finding Fundamental Planes of Galaxy
    Parameters
  • PCA (weakest component) for Outlier Detection
    anomalies, novel discoveries, new objects
  • Link Analysis (Association Mining) for Causal
    Event Detection (e.g., linking Solar Surface,
    CME, and Space Weather events)
  • Clustering analysis Spatial, Temporal, or any
    scientific database parameters
  • Markov models Temporal mining of time series
    data

16
Space Science Knowledge Discovery
17
This is the Informatics Layer
18
This is the Informatics Layer
  • Informatics Layer
  • Provides standardized representations of the
    information extracted for use in the KDD
    (data mining) layer.
  • Standardization is not required (nor feasible) at
    the data source layer.
  • The informatics is discipline-specific.
  • Informatics enables KDD across large distributed
    heterogeneous scientific data repositories.

19
Space Weather Example
CME Coronal Mass Ejection SEP Solar Energetic
Particle
20
Key Role of Discovery Informatics
  • The key role of Discovery Informatics is
  • ... data integration and fusion ...
  • ... across multiple heterogeneous data
    collections ...
  • ... to enable scientific knowledge discovery ...
  • ... and decision support.

21
Personal Reflection
  • Distributed Data Mining
  • ( mining of distributed data)
  • versus
  • Distributed Data Mining
  • ( distributed mining of distributed data)

22
Future Work Discovery Informatics Applications
  • Query-By-Example (QBE) science data systems
  • Find more data entries similar to this one
  • Find the data entry most dissimilar to this
    one
  • Automated Recommendation (Filtering) Systems
  • Other users who examined these data also
    retrieved the following...
  • Other data that are relevant to these data
    include...
  • Information Retrieval Metrics for Scientific
    Databases
  • Precision How much of the retrieved data is
    relevant to my query?
  • Recall How much of the relevant data did my
    query retrieve?
  • Semantic Annotation (Tagging) Services
  • Report discoveries back to the science database
    for community reuse
  • Science / Technical / Math (STEM) Education
  • Transparent reuse and analysis of scientific data
    in inquiry-based classroom learning
    (http//serc.carleton.edu/usingdata/ , DLESE.org
    )
  • Key concepts that need defining (by community
    consensus) Similarity, Relevance, Semantics
    (dictionaries, ontologies)

23
Data Mining and Discovery InformaticsIt is more
than just connecting the dots
Reference http//homepage.interaccess.com/purc
ellm/lcas/Cartoons/cartoons.htm
Write a Comment
User Comments (0)
About PowerShow.com