Trends and Gaps in the Emerging Middleware and the Impact on Handling Large Data - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Trends and Gaps in the Emerging Middleware and the Impact on Handling Large Data

Description:

Biomedical Informatics. Research Network. 16 September 2003 ... The Biomedical Informatics Research Network. a Multi-Scale Brain Imaging Federated Repository ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 26
Provided by: philip336
Category:

less

Transcript and Presenter's Notes

Title: Trends and Gaps in the Emerging Middleware and the Impact on Handling Large Data


1
Trends and Gaps in the Emerging Middleware and
the Impact on Handling Large Data
  • Philip Papadopoulos
  • BIRN-CC, OptIPuter
  • Program Director, Grids and Clusters, SDSC

2
Agenda
  • Hardware Technology Trends - performance gap in
    storage systems
  • BIRN as an example distributed data grid
  • OptIPuter investigating how networking changes
    the fundamentals of software/hardware
    decomposition
  • Basic Grid Services as the emerging building
    block for distributed systems
  • Summary of Gaps - from the IT perspective

3
  • Were at a Crossing of Technology Exponentials
  • or a triple-point of phase change

4
Technology Doubling Laws
  • Moores Law Individual computers double in
    processing power every 18 months
  • Storage Law disk storage capacity doubles every
    12 Months
  • Gilders Law Network bandwidth doubles every 9
    months
  • This exponential growth profoundly changes the
    landscape of Information technology
  • (High-speed) access to networked information
    becomes the dominant feature of future computing
  • For large-scale images secure remote access
    eventually becomes routine

5
CPU speed is growing the slowest
Gilders Law (32X in 4 yrs)
Storage Law (16X in 4yrs)
Moores Law (5X in 4yrs)
Triumph of Light Scientific American. George
Stix, January 2001
6
Gap in Raw Storage Transfer Rates
  • 4 years ago
  • Capacity 9 GB (SCSI), 20 GB (IDE)
  • 1 Terabyte 100 Disks
  • Transfer rate 10 15MB/sec
  • Fill Time 10 Minutes to fill disk at peak
    transfer rate
  • Today
  • Capacity 146GB (SCSI), 250GB (IDE), SATA
    emerging
  • 16X in 4 years, on track
  • 1 Terabyte 6 disks
  • Transfer rate 40-60MB/sec
  • Fill Time 1 hour to fill disk at peak transfer
    rate
  • Extrapolate 4 years
  • 2TB disk, 160MB/sec, 3.5 Hour fill time

7
Expectations
  • Growth in storage capacity is rapidly changing
    common expectations
  • Keep every email I ever wrote/received (including
    viruses)
  • Carry my entire music collection (as MP3s)
  • Store all my digital photographs online
  • (Have all my medical records including images
    online and available to my doctor, hospital
    worldwide)
  • Our storage use expectations are growing rapidly,
    but disk fill times are relatively slowing down
  • Managing large medical images illuminates this
    fundamental gap now

8
Biology applications push on all three axes
  • Simulation, image comparison, data mining
    (computing)
  • Data size and variety (Storage)
  • 3D image data (large data)
  • Potentially petabytes at EM scale
  • Medical Privacy implies encryption that
    significantly adds CPU requirements
  • Sensors of all kinds (lots of variety)
  • Data banks (eg. PDB, Genbank, )
  • Access to remote resources
  • Federation of (very large) data repositories
    (Networking)
  • Seamless integration sensor nets

9
The Biomedical Informatics Research Network a
Multi-Scale Brain Imaging Federated Repository
BIRN Test-beds Multiscale Mouse Models of
Disease, Human Brain Morphometrics, and FIRST
BIRN (10 site project for fMRIs of
Schizophrenics)
10
Some Critical Infrastructure Challenges
  • BIRN Scientists wants to share data, databases,
    and resources
  • Resources are geographically distributed
  • Data throughput requirements are aggressive
  • Encryption/Security is essential
  • Underlying infrastructure should hide complexity
    from the user whenever possible
  • Concentrate on several key areas
  • Scalable infrastructure, interoperable
  • Resource Size (Terabytes today), Resource (10
    sites)
  • Consistency of software deployment across domains
  • High-performance, secure data movement
  • Software/hardware specification/deployment
    (simplify replication complex infrastructure)

11
Replication and Symmetry for Scalability
High-speed IP Network
Grid Interface provides security, identity
Mapping, resource access/abstraction
12
Standard Commodity Hardware for BIRN Sites
NCRR
Cisco 4006
  • Gigabit/10/100 Network Switch Cisco4006
  • Network Statistics System
  • Gigabit Ethernet Network Probe
  • Network Attached Storage Gigabit Ether
  • 1.0 to 4.0 TB
  • Grid POP (Compaq/HP DL380G3)
  • SRB, Globus
  • Dual Processor Linux w/ 1GB memory

GigE Net Probe
Network Stats
General Compute
Grid POP
Net-Attached Storage 1- 4TB
Optional Storage
APC UPS
Leverage high-volume components for
cost/scalability
13
Some BIRN Statistics
  • 15 Racks across 12 institutions
  • Were in the Getting Started Phase of
    distributed data
  • 300,000 data objects and associated meta-data
    managed as a collection
  • 1TB of raw data (8 of current BIRN capacity)
  • Function BIRN now processing human phantoms for
    calibration (? ? and size of data objects)
  • Best-case achievable bandwidth coast-to-coast is
    10MB/sec (27 Hours/Terabyte, 2 Min/GB)
  • Expectations are outstripping SW ability to keep
    up
  • But the raw network capacity is growing, so
    wheres the problem?

14
Data Intensive Scientific Applications Requiring
Experimental Optical Networks
  • Large Data Challenges in Neuro and Earth Sciences
  • Each Data Object is 3D and Gigabytes
  • Data are Generated and Stored in Distributed
    Archives
  • Research is Carried Out on Federated Repository
  • Requirements
  • Computing Requirements ? PC Clusters
  • Communications ? Dedicated Lambdas Over Fiber
  • Data ? Large Peer-to-Peer Lambda Attached Storage
  • Visualization ? Collaborative Volume Algorithms
  • Response
  • OptIPuter Research Project

15
What is OptIPuter?
  • It is a large NSF ITR project funded at 13.5M
    from 2002 2007
  • Fundamentally, asks the question
  • What happens to the structure of machines and
    programs when the network becomes essentially
    infinite?
  • Enabled by improvements in photonic networking,
    10 GigE Dense Wave Division Multiplexing
  • Coupled tightly with key applications (e.g. BIRN)
  • Keeps the IT research grounded and focused
  • We are building (in phases) two high-capacity
    networks with associated modest-sized endpoints

16
The UCSD OptIPuter Deployment
UCSD is building out a high-speed packet-switched
network
To CENIC
Phase I, Fall 02
Phase I, Fall 02
Phase II, 2003
Phase II, 2003
Collocation point
Collocation point
Production Router
SDSC
SDSC
SDSCAnnex
SDSCAnnex
Preuss
High School
JSOE
Engineering
CRCA
Arts
SOM
Medicine
6th College
UndergradCollege
Chemistry
Phys. Sci -Keck
Node M
Collocation
Chiaro Router
  • Per site Links
  • 1 Gbit/s and 4 Gbit/sec
  • 4 Gb/s and 10Gb/s (2004)
  • 10Gb/s and 40Gb/s (2005)
  • 40 Gb/s and 160Gb/s (2007)

SIO
Earth Sciences
Source Phil Papadopoulos, SDSC Greg Hidley,
Cal-(IT)2
17
OptIPuter LambdaGridEnabled by Chiaro Networking
Router
www.calit2.net/news/2002/11-18-chiaro.html
  • Cluster Disk
  • Disk Disk
  • Viz Disk
  • DB Cluster
  • Cluster Cluster

Image Source Phil Papadopoulos, SDSC
18
The Center of the UCSD OptIPuter Network
http//132.239.26.190/view/view.shtml
  • Unique optical routing core
  • 10 Gigabit wire-speed routing
  • Expandable to 5Tb/s today
  • We have the baby Chiaro at 640 Gbit/sec

19
OptIPuter Endpoints are Modest
  • Currently tiny sized clustered endpoints
  • We are at the midpoint of procurement and
    installation the following on Campus
  • Four 32-node PC clusters for computation
  • One ten node visualization cluster
  • Two 9 Mpixel Big Bertha Displays
  • One 48 Node Storage cluster
  • 20TB, 300 Disk Spindles
  • Better balance of endpoint and network

20
BIRN OptIPuter
  • BIRN pushing on distributed data (especially
    image data)
  • OptIPuter pushing on network/storage/cpu
    interactions
  • There still is the problem of how to build
    compositional, high-performance, authenticated,
    and secure software systems
  • High-speed access to remote data still implies
    parallel endpoints
  • Coordinating high-speed transfers is still a
    black art

21
Grids Wide(r)-Area Computing and Storage
  • Grids allow users to access data, computing,
    visualization, instruments
  • Grid Security (GSI) built-in from the beginning
  • Location transparency is key especially when
    resources must be geographically distributed
  • Software is rapidly moving from research grade to
    production grade
  • Grid service abstraction (OGSA) is key software
    change

22
Workflows in a Grid Service-Oriented environment
Osaka U.
PACI Resources
Ucsd.edu
Common Security, discovery, and instantiation
framework of Grid services enables construction
of complex workflows that crosses domains
Need to pay special attention to data movement
issues
23
Simplified Grid Services
Client (Requestor)
Backproject instance
GSI
Grid services leverages existing web services
infrastructure
Service Provider
http//ncmir.ucsd.edu
  1. client formats request (parameters security)
  2. Provider starts instance of service for client
  3. Results returned over net

Simple summary Remote Procedure Call (RPC) for
inter-domain execution
24
Redux
  • Tech trends
  • Large storage expectations, but need to be aware
    of expanding fill times of rotating storage
  • Growth in wide-area networking will rapidly catch
    up to storage system speed
  • Large Images challenges to the infrastructure
  • Raw storage performance
  • Interfacing storage to the network (cross-site
    issues)
  • Adding encryption of medical data adds complexity
  • Transfers will imply parallelism at various
    levels
  • Software systems are moving to open grid services
  • Starting a standard cross-domain remote procedure
    call (RPC)

25
The Critical issue
  • With such rapid changes how do build systems that
    meet the needs of applications communities are
    not standalone/one-off and meet the challenges
    of
  • Integrity
  • Security
  • Performance
  • Scalability
  • Reliability
Write a Comment
User Comments (0)
About PowerShow.com