Title: UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation
1 UK e-Science Future Infrastructure for
Scientific DataMining, Integration and
Visualisation Malcolm Atkinson Director of
National e-Science Centre www.nesc.ac.uk 25th
October 2002 SDMIV workshop, e-Science
InstituteEdinburgh
2Overview
- UK e-Science
- Reminder of Investment and Infrastructure
- International e-Science
- Examples and Collaboration
- Data Access and Integration
- Lego Bricks for Scientific Application Developers
- Tailored Application and Computing Scientists
- A Computer Scientists Christmas List
- Diversity and Opportunity
- The Way Ahead
3e-Science
- Fundamentally about Collaboration
- Sharing
- Ideas
- Thought processes and Stimuli
- Effort
- Resources
- Requires
- Communication
- Common understanding Framework
- Mechanisms for sharing fairly
- Organisation and Infrastructure
Requires Trust
Scientists (Biologists) have done this for
Centuries
4e-Science (take 2)
Text, digital media, structured, organised
curated data, computable models, visualisation,
shared instruments, shared systems, shared
administration,
- Fundamentally about Collaboration
- Sharing
- Ideas
- Thought processes and Stimuli
- Effort
- Resources
- Requires
- Communication
- Common understanding Framework
- Mechanisms for sharing fairly
- Organisation and Infrastructure
Changing the ways Science is done
Nationally Internationally Distributed,
Routine, Daily, Automated,
That Requires very Significant Investment in
DigitalSystems and their Support
5e-Science (take 3)
- Fundamentally about Collaboration
- Sharing
- Ideas
- Thought processes and Stimuli
- Effort
- Resources
- Requires
- Communication
- Common understanding Framework
- Mechanisms for sharing fairly
- Organisation and Infrastructure
Digital networks, digital work-places, digital
instruments,
Metadata, ontologies, standards, shared curated
data, shared codes,
Common platforms, shared software, shared
training,
Citation, Authentication, Authorisation,
Accounting, Provenance, Policies,
Shared Provision of Platform,
The Grid SHOULD make this much easier
by providing a common, supported high-level of
Software and Organisational infrastructure
6Grid Expectations
- Persistence
- Always there, Always Working, Always Supported
- Stability
- You can build on foundations that dont move
- Trustworthy Predictable
- Honours commitments
- Digital policies, digital contracts, security,
- Data integrity, longevity and accessibility
- Performance
- High-level Extensible
- The capabilities you need are already there
- Ubiquitous
- Your collaborators use it
7Grid Reality
Political, Economic Technical issues to Solve
- Persistence
- Always there, Always Working, Always Supported
- Stability
- You can build on foundations that dont move
- Trustworthy Predictable
- Honours commitments
- Digital policies, digital contracts, security,
- Data integrity, longevity and accessibility
- Performance
- High-level Extensible
- The capabilities you need are already there
- Ubiquitous
- Your collaborators use it
Early days but Open Grid Services link with Web
Services GGF standardisation
Only Show in Town
Not yet but very substantial global effort to
achieve this
Good basis for extension Commitment to basic
functionality WS Community effort
Global Industrial Rallying Cry Must work with
Web Services
8UK Grid Network
Nationale-Science Centre
Edinburgh
Glasgow
Newcastle
Access Grid always-on video walls
Belfast
Manchester
Daresbury Lab
Cambridge
Oxford
Hinxton
RAL
Cardiff
London
Southampton
9SuperJanet4, June 2002
20Gbps
10Gbps
Scotland via Glasgow
Scotland via Edinburgh
2.5Gbps
622Mbps
WorldCom Glasgow
WorldCom Edinburgh
155Mbps
NNW
NorMAN
YHMAN
WorldCom Manchester
WorldCom Leeds
Northern Ireland
EMMAN
MidMAN
WorldCom Reading
WorldCom London
EastNet
TVN
External Links
WorldCom Bristol
WorldCom Portsmouth
South Wales MAN
LMN
SWAN BWEMAN
Kentish MAN
Tony Hey July 2001
LeNSE
10National e-Science Centre
- Events
- Workshops
- Research Meetings
- International Meetings
- History of Events
- GGF5
- HPDC11
- Summer school
- gt 50 workshops held
- gt 1000 people in total
- Many return often
- Planned Events
- 25 workshops
- Conferences to 2005
- Visitors
- 3 arrived
- 4 arranged
- International collaboration, visits visitors
- China
- Argonne National Lab
- SDSC
- NCSA
-
- Centre Projects
- Pilot Projects
- Regional Support
- Research Projects
- EPSRC, MRC, WT, SHEFC
Please use this Facility
11A day in the life of NeSC
12UCSF
UIUC
From Klaus Schulten, Center for Biomollecular
Modeling and Bioinformatics, Urbana-Champaign
13 DataGrid Testbed
(gt40)
Testbed Sites
Dubna
Moscow
Lund
Estec KNMI
RAL
Berlin
IPSL
Prague
Paris
Brno
CERN
Lyon
Santander
Milano
Grenoble
PD-LNL
Torino
Madrid
Marseille
BO-CNAF
Pisa
Lisboa
Barcelona
ESRIN
Roma
Valencia
Catania
Francois.Etienne_at_in2p3.fr - Antonia.Ghiselli_at_cnaf.
infn.it
14A Simplified Grid Anatomy
Scientific Application
Application Developers
Grid Plumbing Security Infrastructure
Operations Team
Owners
15The Crux
Keep all the (pink)groups HAPPY
Scientific Application
Application Developers
Grid Plumbing Security Infrastructure
Operations Team
Owners
16A SDMIV Grid Anatomy
SDMIV Users
Scientific Application
Grid Plumbing Security Infrastructure
Data Providers Data Curators
17Database Growth
PDB protein structures
18Data MiningScience vs Commerce
- Data in files FTP a local copy /subset.ASCII or
Binary. - Each scientist builds own analysis toolkit
- Analysis is tcl script of toolkit on local data.
- Some simple visualization tools x vs y
- Data in a database
- Standard reports for standard things.
- Report writers for non-standard things
- GUI tools to explore data.
- Decision trees
- Clustering
- Anomaly finders
Jim Gray UCSC April 2002
19Butsome science is hitting a wallFTP and GREP
are not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 10,000 disks
- At some point you need indices to limit
search parallel data search and analysis - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min ( 1 /GB)
- 2 days and 1K
- 3 years and 1M
50,000 Kg 250 KW 60 Racks 120m2
Jim Gray UCSC April 2002
20OGSA OGSI
Grid Technology
Web Services
www.gridforum.org/ogsi-wg www.gridforum.org/ogsa-w
g www.gridforum.org/
21Web Services
- Rapid Integration
- Dynamic binding
- Commercial Power
- Financial Political
- Independence
- Client from Service
- Service from Client
- Separation
- Function from Delivery
- Description
- WSDL, WSC, WSEF,
- Tools Platforms
- Java ONE, Visual .NET
- WebSphere, Oracle,
www. w3c. org / TR / SOAP or TR/wsdl
22Grid Technology
- Virtual Organisations
- Sharing Collaboration
- Security
- Single Sign in, delegation
- Distribution fast FTP
- But Various Protocols
- Resource Mangement
- Discovery
- Process Creation
- Scheduling
- Monitoring
- Portability
- Ubiquitous APIs Modules
- Govnmt Agency Buy in
- Industrial Buy in
Foster, I., Kesselman, C. and Tuecke, S., The
Anatomy of the Grid Enabling Virtual
Organisations, Intl. J. Supercomputer
Applications, 15(3), 2001 http//www.gridforum.org
/ogsi-wg
23Open Grid Services Architecture
Industrial Commitment
Foster, I., Kesselman, C., Nick, J. and Tuecke,
S., The Physiology of the Grid An Open Grid
Services Architecture for Distributed Systems
Integration
24Scientific Data
- Deluge of Data
- Exponential growth
- Doubling timesAstronomy 12 monthsBio-Sequences
9 monthsFunctional Genomics 6 monthsBytes/dollar
12 to 18 months - Not How big it is but
25Scientific Data
- Deluge of Data
- Exponential growth
- Doubling timesAstronomy 12 monthsBio-Sequences
9 monthsFunctional Genomics 6 monthsBytes/dollar
12 to 18 months - Not How big it is but
- What you do with it
- Sharing
- Curation
- Metadata
- Automated movement, access integration
- Computational Access
26Scientific Data
- Deluge of Data
- Exponential growth
- Doubling timesAstronomy 12 monthsBio-Sequences
9 monthsFunctional Genomics 6 monthsBytes/dollar
12 to 18 months - Not How big it is but
- How you Embrace Manage Change
- The Database is a Knowledge chest
- The Database is a Communication Hub
- Autonomously Managed (Curated) change
- An Essential part of e-BioMedical, Astronomical,
, Science Engineering
Data Federation Integration is Hard
27Wellcome Trust Cardiovascular Functional
Genomics
28Data Access Integration
- Central to e-ScienceAstronomy, Earth Sciences,
Ecology, Biology, Medicine, - Collaboration
- Shared Databases
- Curated Knowledge
- Accumulated Observations
- Accumulated Simulations
- Computation
- Data mining
- Input to models
- Calibration of models
- Presentation
- Publication of results
- Visualisation
29GGF DAIS WG
- Chairs
- Norman Paton (Manchester Uni.)
- Leanne Guy (CERN)
- Dave Pearson (Oracle UK)
- Activity
- BoF GGF4 Toronto
- WG Meeting GGF5 Edinburgh
- Papers for GGF6
- Workshops Mail lists
- Goals
- Agree Standards for Database Access Integration
- Freely available reference implementations
- OGSA-DAI one source focus for discussions
Norman Paton, Inderpal Narang, Leanne Guy, Susan
Maliaka, Greg Ricardi,
http//www.cs.man.ac.uk/grid-db/
30OGSA-DAI project
- Lego kit for Data Access Integration
- Components for e-Science Applications
- Accelerated Application Development
- Multiple Data Models
- Distributed Data
- Access via Grid Proxies
- Integration, Translation Transformation
- Open Source Reference Implementation
- For DAIS-WG standard
- Trigger for Component Construction
- Start a community
31OGSA-DAI Partners
IBM USA
EPCC NeSC
Glasgow
Newcastle
Belfast
Manchester
Daresbury Lab
Oxford
EPCC NeSCIBM UK IBM USA Manchester
e-SC Newcastle e-SCOracle
Oracle
RAL
Cardiff
London
IBM Hursley
Southampton
3 million, 18 months, started February 2002
32Primary Components
33Advanced Components
34Composed Components
35Composing Components
OGSA-DAIComponent
Data Transport
OGSA-DAIComponent
Data Transport
OGSA-DAIComponent
Data Transport
Data Transport
36DAI Key Components
GridDataService GDS Access to data DB
operations GridDataServiceFactory GDSF Makes GDS
GDSF GridDataServiceRegistry GDSR Discovery of
GDS(F) Data GridDataTranslationService Translat
es or Transforms Data GridDataTransportDepot GDTD
Data transport with persistence
Relational XML models supported Role-based
Authorisation Binary structured files
37OGSA Relationship
38DAI portType Usage
39Distributed Query
40OGSA-DAI Time Line
WS GSI UK support ( gt 100 downloads)
XML OGSA Prototypes for Early Adopters
Design Documents Demos for DAIS WG _at_ GGF5
XML OGSA Prototype Available
RDB GT2 / OGSA Prototypes Available
GGF6 WG Papers Prototypes
Ship Alpha Release for GT3 Integration
Presentation Beta _at_ GGF7
Productisation, RAMPS Extension
Feb 02
May 02
Jul 02
Sep 02
Dec 02
Feb 03
May 03
Sep 03
Phase 2 Starts
Phase 1 Starts
41OGSA-DAI Summary
- On Schedule Going Well
- Contributions via DAIS-WG _at_ GGF5 6
- Releases with GT3 Releases scheduled
- Status Early Days
- Released prototypes
- Tested Architectural Design
- Using OGSA
- Working with Early Adopter Pilot Projects
- AstroGrid MyGrid
- First PRODUCT release Dec 02
- Influence OGSA-DAI direction
- Via DAIS-WG Direct messages to us
42 Data Processing
- Processing Characteristics
- Well defined work flow
- Correction, calibration, transformation,filtering,
merging - Relatively static reference data
- Stable processing functions (audited changes)
- Periodic reprocessing from archive
Dave Pearson Provenance and Derivation workshop
18 Oct 02, Chicago
43Analysis and Interpretation
- Analysis Characteristics
- - Variable workflow
- - Standard functions
- - Standard and personal
- filtering and summarisation
- - Retain drill down capability
Dave Pearson Provenance and Derivation workshop
18 Oct 02, Chicago
44Analysis and Interpretation
- Conclusions/Inferences
- Descriptions
- Trends
- Correlations
- Relationships
- Analysis and Interpretation Characteristics
- Highly dynamic work flow
- Multiple data types
- Volatile data
- Annotations, inferences, conclusions
- Evidential reasoning
- Shared multiple versions of truth
- Periodic version consolidation
Dave Pearson Provenance and Derivation workshop
18 Oct 02, Chicago
45Metadata Requirements
- Technical Metadata
- Direct referencing - Physical location and data
schema/structure - Data currency/status version, time stamping
- Accreditation/Access permissions - Ownership
(Dublin Core) - Query time/Governance - data volume, no. of
records, access paths - Contextual Metadata
- Logical referencing physical data
semantic/syntactic ontologies - Lexical translation Thesaurus, ontological
mapping - Named derivations (summarisations)
- Scope of Requirements
- All science communities
- Related to provenance
Dave Pearson Provenance and Derivation workshop
18 Oct 02, Chicago
46Metadata Requirements
- Data Versioning
- Distinguish latest/agreed version of data
- Maintain history record of change
- Synchronise and mirror replicated data
- Distinguish shared personal interpretations
and/or annotations - Provenance
- Record of data processing calibration,
filtering, transformation - Record of workflow methods, standards and
protocols - Reasoning evidential justification for
inferences conclusions - Scope of Requirements
- All science communities
- Includes Technical and Contextual Metadata
Dave Pearson Provenance and Derivation workshop
18 Oct 02, Chicago
47Provenance Issues
- Schema evolution
- Granularity of record
- Processed v Derived
- Inheritance
- Lack of structured annotations, ontologies
- Interactive analysis dynamic workflow
- Multiple derived data sources
- Context of usage
- Best practice can change
- Multiple versions of the truth
- Evidential reasoning
- Existing data applications
- Where is the provenance record stored
Dave Pearson Provenance and Derivation workshop
18 Oct 02, Chicago
48Collaborative Annotation
- See DAS
- Distributed Annotation Service
- Challenges
- Autonomy
- Selective viewing
- Identification
- Provenance
- Derivation
49Biomedical e-Scientists
- Is this one species?
- Understanding bird energy
- Understanding a river / ocean interaction
- Understanding a biochemical pathway
- Understanding a cell
- Understanding a Heart or Brain
- Understanding Rhododendra
- Understanding Evolution
-
- No One-Size fits all solutions
- But sharable re-usable components
50Opportunities
- Many, many
- More than we can address
- Compute needs
- Data management needs
- Data integration needs
-
- Must choose some pioneers
- To meet a range of common requirements
- To provoke rich high-level platform
- To generate re-usable components
- A Long-Term Commitment Needed
51Advancing SDMIV Grid
SDMIV Users
Scientific Application
SDMIV (Grid) Application Component Library
Grid Plumbing Security Infrastructure
52Summary
- e-Science
- Data as well as Compute Challenges
- Needed to be put together
- Need ubiquitous supported consistent platforms
- Grid
- A (potentially) invaluable platform
- Only show in town
- Data Integration
- Hard ? Develop Use Standard kit of parts
- Started to build the kit
- No ready made general integration
- Combines application computing science
- Opportunities
- No one-size fits all, but re-usable subsystems
- Invest in wider range of Problem driven
pioneering - Strategic choices needed