Title: Trends and Gaps in the Emerging Middleware and the Impact on Handling Large Data
1Trends and Gaps in the Emerging Middleware and
the Impact on Handling Large Data
- Philip Papadopoulos
- BIRN-CC, OptIPuter
- Program Director, Grids and Clusters, SDSC
2Agenda
- Hardware Technology Trends - performance gap in
storage systems - BIRN as an example distributed data grid
- OptIPuter investigating how networking changes
the fundamentals of software/hardware
decomposition - Basic Grid Services as the emerging building
block for distributed systems - Summary of Gaps - from the IT perspective
3- Were at a Crossing of Technology Exponentials
- or a triple-point of phase change
4Technology Doubling Laws
- Moores Law Individual computers double in
processing power every 18 months - Storage Law disk storage capacity doubles every
12 Months - Gilders Law Network bandwidth doubles every 9
months - This exponential growth profoundly changes the
landscape of Information technology - (High-speed) access to networked information
becomes the dominant feature of future computing - For large-scale images secure remote access
eventually becomes routine
5CPU speed is growing the slowest
Gilders Law (32X in 4 yrs)
Storage Law (16X in 4yrs)
Moores Law (5X in 4yrs)
Triumph of Light Scientific American. George
Stix, January 2001
6Gap in Raw Storage Transfer Rates
- 4 years ago
- Capacity 9 GB (SCSI), 20 GB (IDE)
- 1 Terabyte 100 Disks
- Transfer rate 10 15MB/sec
- Fill Time 10 Minutes to fill disk at peak
transfer rate - Today
- Capacity 146GB (SCSI), 250GB (IDE), SATA
emerging - 16X in 4 years, on track
- 1 Terabyte 6 disks
- Transfer rate 40-60MB/sec
- Fill Time 1 hour to fill disk at peak transfer
rate - Extrapolate 4 years
- 2TB disk, 160MB/sec, 3.5 Hour fill time
7Expectations
- Growth in storage capacity is rapidly changing
common expectations - Keep every email I ever wrote/received (including
viruses) - Carry my entire music collection (as MP3s)
- Store all my digital photographs online
- (Have all my medical records including images
online and available to my doctor, hospital
worldwide) - Our storage use expectations are growing rapidly,
but disk fill times are relatively slowing down - Managing large medical images illuminates this
fundamental gap now
8Biology applications push on all three axes
- Simulation, image comparison, data mining
(computing) - Data size and variety (Storage)
- 3D image data (large data)
- Potentially petabytes at EM scale
- Medical Privacy implies encryption that
significantly adds CPU requirements - Sensors of all kinds (lots of variety)
- Data banks (eg. PDB, Genbank, )
- Access to remote resources
- Federation of (very large) data repositories
(Networking) - Seamless integration sensor nets
9The Biomedical Informatics Research Network a
Multi-Scale Brain Imaging Federated Repository
BIRN Test-beds Multiscale Mouse Models of
Disease, Human Brain Morphometrics, and FIRST
BIRN (10 site project for fMRIs of
Schizophrenics)
10Some Critical Infrastructure Challenges
- BIRN Scientists wants to share data, databases,
and resources - Resources are geographically distributed
- Data throughput requirements are aggressive
- Encryption/Security is essential
- Underlying infrastructure should hide complexity
from the user whenever possible - Concentrate on several key areas
- Scalable infrastructure, interoperable
- Resource Size (Terabytes today), Resource (10
sites) - Consistency of software deployment across domains
- High-performance, secure data movement
- Software/hardware specification/deployment
(simplify replication complex infrastructure)
11Replication and Symmetry for Scalability
High-speed IP Network
Grid Interface provides security, identity
Mapping, resource access/abstraction
12Standard Commodity Hardware for BIRN Sites
NCRR
Cisco 4006
- Gigabit/10/100 Network Switch Cisco4006
- Network Statistics System
- Gigabit Ethernet Network Probe
- Network Attached Storage Gigabit Ether
- 1.0 to 4.0 TB
- Grid POP (Compaq/HP DL380G3)
- SRB, Globus
- Dual Processor Linux w/ 1GB memory
GigE Net Probe
Network Stats
General Compute
Grid POP
Net-Attached Storage 1- 4TB
Optional Storage
APC UPS
Leverage high-volume components for
cost/scalability
13Some BIRN Statistics
- 15 Racks across 12 institutions
- Were in the Getting Started Phase of
distributed data - 300,000 data objects and associated meta-data
managed as a collection - 1TB of raw data (8 of current BIRN capacity)
- Function BIRN now processing human phantoms for
calibration (? ? and size of data objects) - Best-case achievable bandwidth coast-to-coast is
10MB/sec (27 Hours/Terabyte, 2 Min/GB) - Expectations are outstripping SW ability to keep
up - But the raw network capacity is growing, so
wheres the problem?
14Data Intensive Scientific Applications Requiring
Experimental Optical Networks
- Large Data Challenges in Neuro and Earth Sciences
- Each Data Object is 3D and Gigabytes
- Data are Generated and Stored in Distributed
Archives - Research is Carried Out on Federated Repository
- Requirements
- Computing Requirements ? PC Clusters
- Communications ? Dedicated Lambdas Over Fiber
- Data ? Large Peer-to-Peer Lambda Attached Storage
- Visualization ? Collaborative Volume Algorithms
- Response
- OptIPuter Research Project
15What is OptIPuter?
- It is a large NSF ITR project funded at 13.5M
from 2002 2007 - Fundamentally, asks the question
- What happens to the structure of machines and
programs when the network becomes essentially
infinite? - Enabled by improvements in photonic networking,
10 GigE Dense Wave Division Multiplexing - Coupled tightly with key applications (e.g. BIRN)
- Keeps the IT research grounded and focused
- We are building (in phases) two high-capacity
networks with associated modest-sized endpoints
16The UCSD OptIPuter Deployment
UCSD is building out a high-speed packet-switched
network
To CENIC
Phase I, Fall 02
Phase I, Fall 02
Phase II, 2003
Phase II, 2003
Collocation point
Collocation point
Production Router
SDSC
SDSC
SDSCAnnex
SDSCAnnex
Preuss
High School
JSOE
Engineering
CRCA
Arts
SOM
Medicine
6th College
UndergradCollege
Chemistry
Phys. Sci -Keck
Node M
Collocation
Chiaro Router
- Per site Links
- 1 Gbit/s and 4 Gbit/sec
- 4 Gb/s and 10Gb/s (2004)
- 10Gb/s and 40Gb/s (2005)
- 40 Gb/s and 160Gb/s (2007)
SIO
Earth Sciences
Source Phil Papadopoulos, SDSC Greg Hidley,
Cal-(IT)2
17OptIPuter LambdaGridEnabled by Chiaro Networking
Router
www.calit2.net/news/2002/11-18-chiaro.html
- Cluster Disk
- Disk Disk
- Viz Disk
- DB Cluster
- Cluster Cluster
Image Source Phil Papadopoulos, SDSC
18The Center of the UCSD OptIPuter Network
http//132.239.26.190/view/view.shtml
- Unique optical routing core
- 10 Gigabit wire-speed routing
- Expandable to 5Tb/s today
- We have the baby Chiaro at 640 Gbit/sec
19OptIPuter Endpoints are Modest
- Currently tiny sized clustered endpoints
- We are at the midpoint of procurement and
installation the following on Campus - Four 32-node PC clusters for computation
- One ten node visualization cluster
- Two 9 Mpixel Big Bertha Displays
- One 48 Node Storage cluster
- 20TB, 300 Disk Spindles
- Better balance of endpoint and network
20BIRN OptIPuter
- BIRN pushing on distributed data (especially
image data) - OptIPuter pushing on network/storage/cpu
interactions - There still is the problem of how to build
compositional, high-performance, authenticated,
and secure software systems - High-speed access to remote data still implies
parallel endpoints - Coordinating high-speed transfers is still a
black art
21Grids Wide(r)-Area Computing and Storage
- Grids allow users to access data, computing,
visualization, instruments - Grid Security (GSI) built-in from the beginning
- Location transparency is key especially when
resources must be geographically distributed - Software is rapidly moving from research grade to
production grade - Grid service abstraction (OGSA) is key software
change
22 Workflows in a Grid Service-Oriented environment
Osaka U.
PACI Resources
Ucsd.edu
Common Security, discovery, and instantiation
framework of Grid services enables construction
of complex workflows that crosses domains
Need to pay special attention to data movement
issues
23Simplified Grid Services
Client (Requestor)
Backproject instance
GSI
Grid services leverages existing web services
infrastructure
Service Provider
http//ncmir.ucsd.edu
- client formats request (parameters security)
- Provider starts instance of service for client
- Results returned over net
Simple summary Remote Procedure Call (RPC) for
inter-domain execution
24Redux
- Tech trends
- Large storage expectations, but need to be aware
of expanding fill times of rotating storage - Growth in wide-area networking will rapidly catch
up to storage system speed - Large Images challenges to the infrastructure
- Raw storage performance
- Interfacing storage to the network (cross-site
issues) - Adding encryption of medical data adds complexity
- Transfers will imply parallelism at various
levels - Software systems are moving to open grid services
- Starting a standard cross-domain remote procedure
call (RPC)
25The Critical issue
- With such rapid changes how do build systems that
meet the needs of applications communities are
not standalone/one-off and meet the challenges
of - Integrity
- Security
- Performance
- Scalability
- Reliability