Title: HPGC 2006 Workshop on High-Performance Grid Computing
1 HPGC 2006 Workshop on High-Performance Grid
Computing at IPDPC 2006 Rhodes Island, Greece,
April 25 29, 2006 Major HPC Grid
Projects From Grid Testbeds to Sustainable
High-Performance Grid Infrastructures
Wolfgang Gentzsch, D-Grid, RENCI, GGF GFSG,
e-IRG wgentzsch_at_d-grid.de Thanks to Eric
Aubanel, Virendra Bhavsar, Michael Frumkin, Rob
F. Van der Wijngaart
2 HPGC 2006 Workshop on High-Performance Grid
Computing at IPDPC 2006 Rhodes Island, Greece,
April 25 29, 2006 Major HPC Grid
Projects From Grid Testbeds to Sustainable
High-Performance Grid Infrastructures
Wolfgang Gentzsch, D-Grid, RENCI, GGF GFSG,
e-IRG wgentzsch_at_d-grid.de Thanks to Eric
Aubanel, Virendra Bhavsar, Michael Frumkin, Rob
F. Van der Wijngaart and INTEL
3Focus
- on HPC capabilities of grids
- on sustainable grid infrastructures
- selected six major HPC grid projects
- UK e-Science, US TeraGrid, NAREGI Japan,
- EGEE and DEISA Europe, D-Grid Germany
- and I apologize for not mentioning
- Your favorite grid project, but
4Too Many Major Grids to mention them all
5UK e-Science Gridstarted in early 2001400 Mio
Application independent
6NGS OverviewUser view
- Resources
- 4 Core clusters
- UKs National HPC services
- A range of partner contributions
- Access
- Support UK academic researchers
- Light weight peer review for limited free
resources - Central help desk
- www.grid-support.ac.uk
7NGS OverviewOganisational view
- Management
- GOSC Board
- Strategic direction
- Technical Board
- Technical coordination and policy
-
- Grid Operations Support Centre
- Manages the NGS
- Operates the UK CA over 30 RAs
- Operates central helpdesk
- Policies and procedures
- Manage and monitor partners
8NGS Use
Files stored
Over 320 users
CPU time by user
Users by institution
9NGS Development
- Core Node refresh
- Expand partnership
- HPC
- Campus Grids
- Data Centres
- Digital Repositories
- Experimental Facilities
- Baseline services
- Aim to map user requirements onto standard
solutions - Support convergence/interoperability
- Move further towards project (VO) support
- Support collaborative projects
- Mixed economy
- Core resources
- Shared resources
- Project/project/contract specific resources
10 The Architecture of Gateway
Services
Grid Portal Server
TeraGrid Gateway Services
Proxy Certificate Server / vault
User Metadata Catalog
Application Workflow
Application Deployment
Application Events
Resource Broker
Replica Mgmt
App. Resource catalogs
Core Grid Services
Security
Notification Service
Data Management Service
Grid Orchestration
Resource Allocation
Accounting Service
Policy
Administration Monitoring
Reservations And Scheduling
Courtesy Jay Boisseau
Web Services Resource Framework Web Services
Notification
Physical Resource Layer
11TeraGrid Use
1600 users
600 users
12Delivering User Priorities in 2005
Overall Score (depth of need)
Partners in Need (breadth of need)
Remote File Read/Write
High-Performance File Transfer
Coupled Applications, Co-scheduling
Grid Portal Toolkits
Results of in-depth discussions with 16 TeraGrid
user teams during first annual user survey
(August 2004).
Grid Workflow Tools
Batch Metascheduling
Global File System
Client-Side Computing Tools
Batch Scheduled Parameter Sweep Tools
Advanced Reservations
Data
Capability Type
Grid Computing
Science Gateways
13National Research Grid Infrastructure (NAREGI)
2003-2007
- Petascale Grid Infrastructure RD for Future
Deployment - 45 mil (US) 16 mil x 5 (2003-2007) 125 mil
total - PL Ken Miura (Fujitsu?NII)
- Sekiguchi(AIST), Matsuoka(Titech),
Shimojo(Osaka-U), Aoyagi (Kyushu-U) - Participation by multiple (gt 3) vendors,
Fujitsu, NEC, Hitachi, NTT, etc. - NOT AN ACADEMIC PROJECT, 100FTEs
- Follow and contribute to GGF Standardization,
esp. OGSA
NEC
Focused Grand Challenge Grid Apps Areas
Osaka-U
Titech
AIST
Fujitsu
IMS
Hitachi
U-Kyushu
14NAREGI Software Stack (Beta Ver. 2006)
Grid-Enabled Nano-Applications (WP6)
Grid PSE
Grid Visualization
Grid Programming (WP2) -Grid RPC -Grid MPI
WP3
Grid Workflow (WFML (Unicore WF))
Distributed Information Service(CIM)
Super Scheduler
Data (WP4)
WP1
Packaging
(WSRF (GT4Fujitsu WP1) GT4 and other services)
Grid VM (WP1)
Grid Security and High-Performance Grid
Networking (WP5)
SuperSINET
NII
IMS
Research Organizations
Major University Computing Centers
Computing Resources and Virtual Organizations
15GridMPI
- MPI applications run on the Grid environment
- Metropolitan area, high-bandwidth environment ?
10 Gpbs, ? 500 miles (smaller than 10ms one-way
latency) - Parallel Computation
- Larger than metropolitan area
- MPI-IO
computing resource site A
computing resource site B
Wide-area Network
Single (monolithic) MPI application over the Grid
environment
16EGEE Infrastructure
Country participating in EGEE
- Scale
- gt 180 sites in 39 countries
- 20 000 CPUs
- gt 5 PB storage
- gt 10 000 concurrent jobs per day
- gt 60 Virtual Organisations
17The EGEE project
- Objectives
- Large-scale, production-quality infrastructure
for e-Science - leveraging national and regional grid activities
worldwide - consistent, robust and secure
- improving and maintaining the middleware
- attracting new resources and users from industry
as well as science - EGEE
- 1st April 2004 31 March 2006
- 71 leading institutions in 27 countries,
federated in regional Grids - EGEE-II
- Proposed start 1 April 2006 (for 2 years)
- Expanded consortium
- gt 90 partners in 32 countries (also non-European
partners) - Related projects, incl.
- BalticGrid
- SEE-GRID
- EUMedGrid
18Applications on EGEE
- More than 20 applications from 7 domains
- High Energy Physics
- 4 LHC experiments (ALICE, ATLAS, CMS, LHCb)
- BaBar, CDF, DØ, ZEUS
- Biomedicine
- Bioinformatics (Drug Discovery, GPS_at_,
Xmipp_MLrefine, etc.) - Medical imaging (GATE, CDSS, gPTM3D, SiMRI 3D,
etc.) - Earth Sciences
- Earth Observation, Solid Earth Physics,
Hydrology, Climate - Computational Chemistry
- Astronomy
- MAGIC
- Planck
- Geo-Physics
- EGEODE
- Financial Simulation
- E-GRID
Another 8 applications from 4 domains are in
evaluation stage
19Steps for Grid-enabling applications II
- Tools to easily access Grid resources through
high level Grid middleware (gLite) - VO management (VOMS etc.)
- Workload management
- Data management
- Information and monitoring
- Application can
- interface directly to gLite
- or
- use higher level services such as portals,
application specific workflow systems etc.
20EGEE Performance Measurements
- Information about resources (static dynamic)
- Computing machine properties (CPUs, memory
architecture, ..), platform properties (OS,
compiler, other software, ), load - Data storage location, access properties, load
- Network bandwidth, load
- Information about applications
- Static computing and data requirements to reduce
search space - Dynamic changes in computing and data
requirements (might need re-scheduling) - Plus
- Information about Grid services (static
dynamic) - Which services available
- Status
- Capabilities
21Sustainability Beyond EGEE-II
- Need to prepare for permanent Grid infrastructure
- Maintain Europes leading position in global
science Grids - Ensure a reliable and adaptive support for all
sciences - Independent of project funding cycles
- Modelled on success of GÉANT
- Infrastructure managed centrally in collaboration
with national bodies
22- e-Infrastructures Reflection Group
- e-IRG Mission
- to support on political, advisory and
monitoring level, - the creation of a policy and administrative
framework - for the easy and cost-effective shared use of
electronic resources in Europe - (focusing on Grid-computing, data storage,
and networking resources) - across technological, administrative and
national domains.
23DEISA PerspectivesTowards cooperative extreme
computing in Europe
- Victor Alessandrini
- IDRIS - CNRS
- va_at_idris.fr
24The DEISA Supercomputing Environment(21.900
processors and 145 Tf in 2006, more than 190 Tf
in 2007)
- IBM AIX Super-cluster
- FZJ-Julich, 1312 processors, 8,9 teraflops peak
- RZG Garching, 748 processors, 3,8 teraflops
peak - IDRIS, 1024 processors, 6.7 teraflops peak
- CINECA, 512 processors, 2,6 teraflops peak
- CSC, 512 processors, 2,6 teraflops peak
- ECMWF, 2 systems of 2276 processors each, 33
teraflops peak - HPCx, 1600 processors, 12 teraflops peak
- BSC, IBM PowerPC Linux system (MareNostrum) 4864
processeurs, 40 teraflops peak - SARA, SGI ALTIX Linux system, 1024 processors, 7
teraflops peak - LRZ, Linux cluster (2.7 teraflops) moving to SGI
ALTIX system (5120 processors and 33 teraflops
peak in 2006, 70 teraflops peak in 2007) - HLRS, NEC SX8 vector system, 646 processors, 12,7
teraflops peak.
25DEISA objectives
- To enable Europes terascale science by the
integration of Europes most powerful
supercomputing systems. - Enabling scientific discovery across a broad
spectrum of science and technology is the only
criterion for success - DEISA is a European Supercomputing Service built
on top of existing national services. - Integration of national facilities and services,
together with innovative operational models - Main focus is HPC and Extreme Computing
applications that cannot by supported by the
isolated national services - Service providing model is the transnational
extension of national HPC centers - Operations,
- User Support and Applications Enabling,
- Network Deployment and Operation,
- Middleware services.
26About HPC
- Dealing with large complex systems requires
exceptional computational resources. For
algorithmic reasons, resources grow much faster
than the systems size and complexity. - Dealing with huge datasets, involving large
files. Typical datasets are several PBytes. - Little usage of commercial or public domain
packages. Most applications are corporate codes
incorporating specialized know how. Specialized
user support is important. - Codes are fine tuned and targeted for a
relatively small number of well identified. - computing platforms. They are extremely
sensitive to the production environment. - Main requirement for high performance is
bandwidth (processor to memory, processor to
processor, node to node, system to system).
27HPC and Grid Computing
- Problem the speed of light is not big enough
- Finite signal propagation speed boosts message
passing latencies in a WAN from a few
microseconds to tens of milliseconds (if A is
in Paris and B in Helsinki) - If A and B are two halves of a tightly coupled
complex system, communications are frequent and
the enhanced latencies will kill performance. - Grid computing works best for embarrassingly
parallel applications, or coupled software
modules with limited communications. - Example A is an ocean code, and B an atmospheric
code. There is no bulk interaction. - Large, tightly coupled parallel applications
should be run in a single platform. This is why
we still need high end supercomputers. - DEISA implements this requirement by rerouting
jobs and balancing the computational workload at
a European scale.
28 Applications for Grids
- Single-CPU Jobs jobmix, many users, many serial
applications, suitable for grid (e.g in
universities and research centers) - Array Jobs 100s/1000s of jobs, one user, one
serial application, varying input parameters,
suitable for grid (e.g. parameter studies in
Optimization, CAE, Genomics, Finance) - Massively Parallel Jobs, loosely coupled one
job, one user, one parallel application, no/low
communication, scalable, fine-tune for grid
(time-explicit algorithms, film rendering,
pattern recognition) - Parallel Jobs, tightly coupled one job, one
user, one parallel application, high interprocs
communication, not suitable for distribution over
the grid, but for parallel system in the grid
(time-implicit algorithms, direct solvers, large
linear algebra equation systems)
29Objectives of e-Science Initiative
German D-Grid Project Part of 100 Mio Euro
e-Science in Germany
- Building one Grid Infrastructure in Germany
- Combine existing German grid activities
- Development of e-science services for the
research community - Science Service Grid Services for Scientists
- Important Sustainability
- Production grid infrastructure after the
funding period - Integration of new grid communities (2.
generation) - Evaluation of new business models for grid
services
30e-Science Projects
D-Grid
Knowledge Management
Astro-Grid
C3-Grid
HEP-Grid
IN-Grid
MediGrid
ONTOVERSE
WIKINGER
WIN-EM
Textgrid
Im Wissensnetz
. . .
Generic Grid Middleware and Grid Services
eSciDoc
VIOLA
31DGI D-Grid Middleware Infrastructure
User
Application Development and User Access
GAT API
Plug-In
GridSphere
UNICORE
Nutzer
High-levelGrid Services
SchedulingWorkflow Management
Monitoring
LCG/gLite
Data management
Basic Grid Services
AccountingBilling User/VO-Mngt
Globus 4.0.1
Security
Resourcesin D-Grid
DistributedCompute Resources
NetworkInfrastructur
DistributedData Archive
Data/Software
32- Key Characteristics of D-Grid
- Generic Grid infrastructure for German research
communities - Focus on Sciences and Scientists, not industry
- Strong influence of international projects
EGEE, Deisa, - CrossGrid, CoreGrid, GridLab, GridCoord,
UniGrids, NextGrid, - Application-driven (80 of funding), not
infrastructure-driven - Focus on implementation, not research
- Phase 1 2 50 MEuro, 100 research
organizations
33Conclusion moving towards Sustainable Grid
Infrastructures OR Why Grids are here to stay !
34 Reason 1 Benefits
- Resource Utilization increase from 20 to 80
- Productivity more work done in shorter time
- Agility flexible actions and re-actions
- On Demand get resources, when you need them
- Easy Access transparent, remote, secure
- Sharing enable collaboration over the network
- Failover migrate/restart applications
automatically - Resource Virtualization access compute services,
not servers - Heterogeneity platforms, OSs, devices,
software - Virtual Organizations build dismantle on the
fly
35Reason 2 StandardsThe Global Grid Forum
- Community-driven set of working groups that are
developing standards and best practices for
distributed computing efforts - Three primary functions community, standards,
and operations - Standards Areas Infrastructure, Data, Compute,
Architecture, Applications, Management, Security,
and Liaison - Community Areas Research Applications, Industry
Applications, Grid Operations, Technology
Innovations, and Major Grid Projects - Community Advisory Board represents the different
communities and provides input and feedback to
GGF
36Reason 3 Industry EGA, Enterprise Grid
Alliance
- Industry-driven consortium to implement standards
in industry products and make them interoperable - Founding members EMC, Fujitsu Siemens Computers,
HP, NEC, Network Appliance, Oracle and Sun, plus
20 Associate Members - May 11, 2005 Enterprise Grid Reference Model
v1.0
37Reason 3 Industry EGA, Enterprise Grid
Alliance
- Industry-driven consortium to implement standards
in industry products and make them interoperable - Founding members EMC, Fujitsu Siemens Computers,
HP, NEC, Network Appliance, Oracle and Sun, plus
20 Associate Members - May 11, 2005 Enterprise Grid Reference Model
v1.0
Feb06 GGF EGF signed a letter of intent to
merge. A joint team is planning the transition,
expected to be complete this summer
38Reason 4 OGSAONE Open Grid Services
Architecture
OGSA
Web Services
Grid Technologies
OGSA Open Grid Service
Architecture Integrates grid technologies with
Web Services (OGSA gt WS-RF)
Defines the key components of the grid
OGSA enables the integration of services and
resources across distributed, heterogeneous,
dynamic, virtual organizations whether within a
single enterprise or extending to external
resource-sharing and service-provider
relationships.
39Reason 5 Quasi-Standard Tools Example The
Globus Toolkit
- Globus Toolkit provides four major functions for
building grids
Courtesy Gridwise Technologies
40. . . . and
- Seamless, secure, intuitive access to distributed
resources data - Available as Open Source
- Features intuitive GUI with single sign-on,
X.509 certificates for AA, workflow engine for
multi-site, multi-step workflows, job monitoring,
application support, secure data transfer,
resource management, and more - In production
Courtesy Achim Streit, FZJ
41Globus 2.4 ? UNICORE
WS-Resource based Resource Management Framework
for dynamic resource information and resource
negotiation
Client
Portal
Command Line
WS-RF
WS-RF
WS-RF
WS-RF
Gateway Service Registry
Gateway
WS-RF
WS-RF
WS-RF
Workflow Engine
FileTransfer
UserManagement(AAA)
Network Job Supervisor
Monitoring
ResourceManagement
ApplicationSupport
WS-RF
WS-RF
WS-RF
Courtesy Achim Streit, FZJ
42Reason 6 Global Grid Community
437 Projects/Initiatives Testbeds Companies
- Altair
- Avaki
- Axceleon
- Cassatt
- Datasynapse
- Egenera
- Entropia
- eXludus
- GridFrastructure
- GridIron
- GridSystems
- Gridwise
- GridXpert
- HP Utility Data Center
- IBM Grid Toolbox
- Kontiki
- Metalogic
- Noemix
- Oracle 10g
- CO Grid
- Compute-against-Cancer
- D-Grid
- DeskGrid
- DOE Science Grid
- EEGE
- EuroGrid
- European DataGrid
- FightAIDS_at_home
- Folding_at_home
- GRIP
- NASA IPG
- NC BioGrid
- NC Startup Grid
- NC Statewide Grid
- NEESgrid
- NextGrid
- Nimrod
- Ninf
- ActiveGrid
- BIRN
- Condor-G
- Deisa
- Dame
- EGA
- EnterTheGrid
- GGF
- Globus
- Globus Alliance
- GridBus
- GridLab
- GridPortal
- GRIDtoday
- GriPhyN
- I-WAY
- Knowledge Grid
- Legion
- MyGrid
448 FP6 Grid Technologies Projects
Call 5 start Summer 2006
EU Funding 124 M
supporting the NESSI ETP Grid community
Grid services, business models
trust, security
platforms, user environments
data, knowledge, semantics, mining
Specific support action
Integrated project
Network of excellence
Specific targeted research project
45Reason 9 Enterprise Grids
SunRay Access
Browser Access via GEP
Workstation Access
Optional Control Network (Gbit-E)
Myrinet
Myrinet
Servers, Blades, VIZ
Myrinet
Linux Racks
Grid Manager
Workstations
Sun Fire Link
Data Network (Gbit-E)
NAS/NFS
Simple NFS
HA NFS
Scalable QFS/NFS
46 Enterprise Grid Reference Architecture
SunRay Access
Browser Access via GEP
Access
Workstation Access
Optional Control Network (Gbit-E)
Myrinet
Myrinet
Servers, Blades, VIZ
Myrinet
Linux Racks
Compute
Grid Manager
Workstations
Sun Fire Link
Data Network (Gbit-E)
Data
NAS/NFS
Simple NFS
HA NFS
Scalable QFS/NFS
471000s of Enterprise Grids in Industry
- Life Sciences
- Startup and cost efficient
- Custom research or limited use applications
- Multi-day application runs (BLAST)
- Exponential Combinations
- Limited administrative staff
- Complementary techniques
- Electronic Design
- Time to Market
- Fastest platforms, largest Grids
- License Management
- Well established application suite
- Large legacy investment
- Platform Ownership issues
- Financial Services
- Market simulations
- Time IS Money
- Proprietary applications
- Multiple Platforms
- Multiple scenario execution
- Need instant results analysis tools
- High Performance Computing
- Parallel Reservoir Simulations
- Geophysical Ray Tracing
- Custom in-house codes
- Large scale, multi-platform execution
48Reason 10 Grid Service Providers Example BT
Pre-GRID IT asset usage 10-15
- Inside data center, within Firewall
- Virtual use of own IT assets
- The GRID virtualiser engine inside Firewall
- Opens up under-used ICT assets
- improves TCO, ROI and Apps performance
- BUT
- Intra-enterprise GRID is self limiting
- Pool of virtualised assets is restricted by
firewall - Does not support Inter-Enterprise usage
- BT is focussing on managed Grid solution
ENTERPRISE
WANS
LANS
Virtualised assets
GRID Engine
Post-GRID IT asset usage 70-75
Courtesy Piet Bel, BT
49BTs Virtual Private Grid ( VPG )
ENTERPRISE
WANS
LANS
Virtualised IT assets
GRID Engine
BT NETWORK
GRID ENGINE
Courtesy Piet Bel, BT
50Reason 11 There will be a Market for Grids
51 General Observations on Grid Performance
- Today, there are 100s of important grid projects
around the world - GGF identifies about 15 research projects which
have major impact - Most research grids focus on HPC and
collaboration, most industry grids focus on
utilization and automation - Many grids are driven by user / application
needs, few grid projects are driven by
infrastructure research - Few projects focus on performance / benchmarks
where performance is mostly seen at the job /
computation / application level - Need for metrics and measurements that help us
understand grids - In a grid, application performance has 3 major
areas of concern system capabilities, network,
and software infrastructure - Evaluating performance in a grid is different
from classic benchmarking, because grids are
dynamically changing systems incorporating new
components.
52 The Grid Engine
Thank You !
wgentzsch_at_d-grid.de