Tony Hey - PowerPoint PPT Presentation

About This Presentation
Title:

Tony Hey

Description:

(With thanks to Jim Gray) CONTENT Scholarly Communication, Institutional ... Droegemeier, Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim Gray, Yike Guo, Liz ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 51
Provided by: tony80
Category:
Tags: hey | tony

less

Transcript and Presenter's Notes

Title: Tony Hey


1
Life Sciences
e-Science and its Implications for the Library
Community
Earth Sciences
Computer andInformation Sciences
  • Tony Hey
  • Corporate Vice President
  • Technical Computing
  • Microsoft Corporation


New Materials,Technologiesand Processes
MultidisciplinaryResearch
2
Lickliders Vision
  • Lick had this concept all of the stuff
    linked together throughout the world, that you
    can use a remote computer, get data from a remote
    computer, or use lots of computers in your job
  • Larry Roberts Principal Architect of the ARPANET

3
Physics and the Web
  • Tim Berners-Lee developed the Web at CERN as a
    tool for exchanging information between the
    partners in physics collaborations
  • The first Web Site in the USA was a link to the
    SLAC library catalogue
  • It was the international particle physics
    community who first embraced the Web
  • Killer application for the Internet
  • Transformed modern world academia, business and
    leisure

4
Beyond the Web?
  • Scientists developing collaboration technologies
    that go far beyond the capabilities of the Web
  • To use remote computing resources
  • To integrate, federate and analyse information
    from many disparate, distributed, data resources
  • To access and control remote experimental
    equipment
  • Capability to access, move, manipulate and mine
    data is the central requirement of these new
    collaborative science applications
  • Data held in file or database repositories
  • Data generated by accelerator or telescopes
  • Data gathered from mobile sensor networks

5
What is e-Science?
  • e-Science is about global collaboration in
    key areas of science, and the next generation of
    infrastructure that will enable it
  • John Taylor
  • Director General of Research Councils
  • UK, Office of Science and Technology

6
The e-Science Vision
  • e-Science is about multidisciplinary science and
    the technologies to support such distributed,
    collaborative scientific research
  • Many areas of science are in danger of being
    overwhelmed by a data deluge from new
    high-throughput devices, sensor networks,
    satellite surveys
  • Areas such as bioinformatics, genomics, drug
    design, engineering, healthcare require
    collaboration between different domain experts
  • e-Science is a shorthand for a set of
    technologies to support collaborative networked
    science

7
e-Science Vision and Reality
  • Vision
  • Oceanographic sensors - Project Neptune
  • Joint US-Canadian proposal
  • Reality
  • Chemistry The Comb-e-Chem Project
  • Annotation, Remote Facilities and e-Publishing

8
http//www.neptune.washington.edu/
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
The Comb-e-Chem Project
Automatic Annotation
Video Data Stream
HPC Simulation
Data Mining and Analysis
StructuresDatabase
Diffractometer
Combinatorial Chemistry Wet Lab
National X-RayService
Middleware
19
National Crystallographic Service
Send sample material to NCS service
Search materials database and predict properties
using Grid computations
Download full data on materials of interest
Collaborate in e-Lab experiment and obtain
structure
20
A digital lab book replacement that chemists were
able to use, and liked
21
Monitoring laboratory experiments using a broker
delivered over GPRS on a PDA
22
Crystallographic e-Prints
Direct Access to Raw Data from scientific
papers
Raw data sets can be very large - stored at UK
National Datastore using SRB software
23
eBank Project
Undergraduate Students
Digital Library
Graduate Students
E-Scientists
E-Scientists
E-Scientists
Grid
5
E-Experimentation
Entire E-Science CycleEncompassing
experimentation, analysis, publication, research,
learning
24
Support for e-Science
  • Cyberinfrastructure and e-Infrastructure
  • In the US, Europe and Asia there is a common
    vision for the cyberinfrastructure required to
    support the e-Science revolution
  • Set of Middleware Services supported on top of
    high bandwidth academic research networks
  • Similar to vision of the Grid as a set of
    services that allows scientists and industry
    to routinely set up Virtual Organizations for
    their research or business
  • Many companies emphasize computing cycle aspect
    of Grids
  • The Microsoft Grid vision is more about data
    management than about compute clusters

25
Six Key Elements for a Global
Cyberinfrastructure for e-Science
  1. High bandwidth Research Networks
  2. Internationally agreed AAA Infrastructure
  3. Development Centers for Open Standard Grid
    Middleware
  4. Technologies and standards for Data Provenance,
    Curation and Preservation
  5. Open access to Data and Publications via
    Interoperable Repositories
  6. Discovery Services and Collaborative Tools

26
The Web Services Magic Bullet
27
ComputationalModeling
28
Technical Computing in Microsoft
  • Radical Computing
  • Research in potential breakthrough technologies
  • Advanced Computing for Science and Engineering
  • Application of new algorithms, tools and
    technologies to scientific and engineering
    problems
  • High Performance Computing
  • Application of high performance clusters and
    database technologies to industrial applications

29
New Science Paradigms
  • Thousand years ago Experimental Science
  • - description of natural phenomena
  • Last few hundred years Theoretical Science
  • - Newtons Laws, Maxwells Equations
  • Last few decades Computational Science
  • - simulation of complex phenomena
  • Today e-Science or Data-centric Science
  • - unify theory, experiment, and simulation
  • - using data exploration and data mining
  • Data captured by instruments
  • Data generated by simulations
  • Processed by software
  • Scientist analyzes databases/files
  • (With thanks to Jim Gray)

30
Advanced Computing for Science and Engineering
Bioinformatics
Energy Science
Engineering
Earth Science
. . .
31
Top 500 Supercomputer Trends
Clusters over 50
Industry usage rising
x86 is winning
GigE is gaining
32
Key Issues for e-Science
  • Workflows
  • The LEAD Project
  • The Data Chain
  • From Acquisition to Preservation
  • Scholarly Communication
  • Open Access to Data and Publications

33
The LEAD Project
Better predictions for Mesoscale weather
34
The LEAD Vision
  • Analysis/Assimilation
  • Quality Control
  • Retrieval of Unobserved
  • Quantities
  • Creation of Gridded Fields

Prediction/Detection PCs to Teraflop Systems
  • Product Generation,
  • Display,
  • Dissemination
  • DYNAMIC OBSERVATIONS

Models and Algorithms Driving Sensors
The CS challenge Build a virtual eScience
laboratory to support experimentation and
education leading to this vision.
  • End Users
  • NWS
  • Private Companies
  • Students

35
Composing LEAD Services
  • Need to construct workflows that are
  • Data Driven
  • The weather input stream defines the nature of
    the computation
  • Persistent and Agile
  • An agent mines a data stream and notices an
    interesting feature. This event may trigger a
    workflow scenario that has been waiting for
    months
  • Adaptive
  • The weather changes
  • Workflow may have to change on-the-fly
  • Resources

36
Example LEAD Workflow
37
The e-Science Data Chain
  • Data Acquisition
  • Data Ingest
  • Metadata
  • Annotation
  • Provenance
  • Data Storage
  • Curation
  • Preservation

38
The Data Deluge
  • In the next 5 years e-Science projects will
    produce more scientific data than has been
    collected in the whole of human history
  • Some normalizations
  • The Bible 5 Megabytes
  • Annual refereed papers 1 Terabyte
  • Library of Congress 20 Terabytes
  • Internet Archive (1996 2002) 100 Terabytes
  • In many fields new high throughput devices,
    sensors and surveys will be producing Petabytes
    of scientific data

39
The Problem for the e-Scientist
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it?
  • How to reorganize it?
  • How to coexist cooperate with others?
  • Data Query and Visualization tools
  • Support/training
  • Performance
  • Execute queries in a minute
  • Batch (big) query scheduling

40
Digital Curation?
  • In 20 years can guarantee that the operating
    system and spreadsheet program and the hardware
    used to store data will not exist
  • Need research curation technologies such as
    workflow, provenance and preservation
  • Need to liaise closely with individual research
    communities, data archives and libraries
  • The UK has set up the Digital Curation Centre
    in Edinburgh with Glasgow, UKOLN and CCLRC
  • Attempt to bring together skills of scientists,
    computer scientists and librarians

41
Digital Curation Centre
  • Actions needed to maintain and utilise digital
    data and research results over entire life-cycle
  • For current and future generations of users
  • Digital Preservation
  • Long-run technological/legal accessibility and
    usability
  • Data curation in science
  • Maintenance of body of trusted data to represent
    current state of knowledge
  • Research in tools and technologies
  • Integration, annotation, provenance, metadata,
    security..

42
Berlin Declaration 2003
  • To promote the Internet as a functional
    instrument for a global scientific knowledge base
    and for human reflection
  • Defines open access contributions as including
  • original scientific research results, raw data
    and metadata, source materials, digital
    representations of pictorial and graphical
    materials and scholarly multimedia material

43
NSF Atkins Report on Cyberinfrastructure
  • the primary access to the latest findings in a
    growing number of fields is through the Web, then
    through classic preprints and conferences, and
    lastly through refereed archival papers
  • archives containing hundreds or thousands of
    terabytes of data will be affordable and
    necessary for archiving scientific and
    engineering information

44
MIT DSpace Vision
  • Much of the material produced by faculty,
    such as datasets, experimental results and rich
    media data as well as more conventional
    document-based material (e.g. articles and
    reports) is housed on an individuals hard drive
    or department Web server. Such material is often
    lost forever as faculty and departments change
    over time.
  •  

45
Publishing Data Analysis Is Changing
Roles Authors Publishers Curators Archives Consume
rs
Traditional Scientists Journals Libraries Archives
Scientists
Emerging Collaborations Project web site DataDoc
Archives Digital Archives Scientists
46
Data Publishing The Background
  • In some areas notably biology databases
    are replacing (paper) publications as a medium of
    communication
  • These databases are built and maintained with a
    great deal of human effort
  • They often do not contain source experimental
    data - sometimes just annotation/metadata
  • They borrow extensively from, and refer to, other
    databases
  • You are now judged by your databases as well as
    your (paper) publications
  • Upwards of 1000 (public databases) in genetics

47
Data Publishing The issues
  • Data integration
  • Tying together data from various sources
  • Annotation
  • Adding comments/observations to existing data
  • Becoming a new form of communication
  • Provenance
  • Where did this data come from?
  • Exporting/publishing in agreed formats
  • To other programs as well as people
  • Security
  • Specifying/enforcing read/write access to parts
    of your data

48
Interoperable Repositories?
  • Paul Ginspargs arXiv at Cornell has demonstrated
    new model of scientific publishing
  • Electronic version of preprints hosted on the
    Web
  • David Lipman of the NIH National Library of
    Medicine has developed PubMedCentral as
    repository for NIH funded research papers
  • Microsoft funded development of portable PMC
    now being deployed in UK and other countries
  • Stevan Harnads self-archiving EPrints project
    in Southampton provides a basis for OAI-compliant
    Institutional Repositories
  • Many national initiatives around the world moving
    towards mandating deposition of full text of
    publicly funded research papers in repositories

49
Microsoft Strategy for e-Science
  • Microsoft intends to work with the scientific
    and library communities
  • to define open standard and/or interoperable
    high-level services, work flows and tools
  • to assist the community in developing open
    scholarly communication and interoperable
    repositories

50
Acknowledgements
  • With special thanks to Kelvin Droegemeier,
    Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim
    Gray, Yike Guo, Liz Lyon and Beth Plale
Write a Comment
User Comments (0)
About PowerShow.com