Title: High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB
1High Performance Presentation5
slides/Minute?(65 slides / 15 minutes)IO and
DB stuff for LSST
- A new world record?
- Jim Gray
- Microsoft Research
2TerraServer Lessons Learned
- Hardware is 5 9s (with clustering)
- Software is 5 9s (with clustering)
- Admin is 4 9s (offline maintenance)
- Network is 3 9s (mistakes, environment)
- Simple designs are best
- 10 TB DB is management limit1 PB 100 x 10 TB
DBthis is 100x better than 5 years ago.(yahoo!,
HotMail are 300TB, Google! Is 2PB) - Minimize use of tape
- Backup to disk (snapshots)
- Portable disk TBs
3Serving BIG images
- Break into tiles (compressed)
- 10KB for modems
- 1MB for LANs
- Mosaic the tiles for pan, crop
- Store image pyramid for zoom
- 2x zoom only adds 33 overhead1 ¼ 1/16
- Use a spatial index to cluster find objects
4Economics
- People are more than 50 of costs
- Disks are more than 50 of capital
- Networking is the other 50
- People
- Phone bill
- Routers
- Cpus are free (they come with the disks)
5SkyServer/ SkyQuery Lessons
- DB is easy
- Search
- It is BEST to index
- You can put objects and attributes in a row
(SQL puts big blobs off-page) - If you cant index, you can extract attributes
and quickly compare - SQL can scan at 5M records/cpu/second
- Sequential scans are embarrassingly parallel
- Web services are easy
- XML Data Sets
- a universal way to represent answers
- minimize round trips 1 request/response
- Diffgrams allow disconnected update
6How Will We Find Stuff?Put everything in the DB
(and index it)
- Need dbms features Consistency, Indexing,
Pivoting, Queries, Speed/scalability, Backup,
replicationIf you dont use one, your creating
one! - Simple logical structure
- Blob and link is all that is inherent
- Additional properties (facets extra
tables)and methods on those tables
(encapsulation) - More than a file system
- Unifies data and meta-data
- Simpler to manage
- Easier to subset and reorganize
- Set-oriented access
- Allows online updates
- Automatic indexing, replication
SQL
7How Do We Represent Data To The Outside World?
lt?xml version"1.0" encoding"utf-8" ?gt -
ltDataSet xmlns"http//WWT.sdss.org/"gt -
ltxsschema id"radec" xmlns"" xmlnsxs"http//ww
w.w3.org/2001/XMLSchema" xmlnsmsdata"urnschemas
-microsoft-comxml-msdata"gt ltxselement
name"radec" msdataIsDataSet"true"gt ltxselement
name"Table"gt ltxselement name"ra"
type"xsdouble" minOccurs"0" /gt ltxselement
name"dec" type"xsdouble" minOccurs"0" /gt
- ltdiffgrdiffgram xmlnsmsdata"urnschemas-micr
osoft-comxml-msdata" xmlnsdiffgr"urnschemas-m
icrosoft-comxml-diffgram-v1"gt - ltradec
xmlns""gt - ltTable diffgrid"Table1"
msdatarowOrder"0"gt ltragt184.028935351008lt/ragt
ltdecgt-1.12590950121524lt/decgt lt/Tablegt -
ltTable diffgrid"Table10" msdatarowOrder"9"gt
ltragt184.025719033547lt/ragt ltdecgt-1.2179582792018
6lt/decgt lt/Tablegt lt/radecgt lt/diffgrdiffgramgt lt/
DataSetgt
- File metaphor too primitive just a blob
- Table metaphor too primitive just records
- Need Metadata describing data context
- Format
- Providence (author/publisher/ citations/)
- Rights
- History
- Related documents
- In a standard format
- XML and XML schema
- DataSet is great example of this
- World is now defining standard schemas
schema
Data or difgram
8Emerging Concepts
- Standardizing distributed data
- Web Services, supported on all platforms
- Custom configure remote data dynamically
- XML Extensible Markup Language
- SOAP Simple Object Access Protocol
- WSDL Web Services Description Language
- DataSets Standard representation of an answer
- Standardizing distributed computing
- Grid Services
- Custom configure remote computing dynamically
- Build your own remote computer, and discard
- Virtual Data new data sets on demand
9Szalays LawThe utility of N comparable
datasets is N2
- Metcalfs law applies to telephones, fax,
Internet. - Szalay argues as followsEach new dataset gives
new information2-way combinations give new
information. - Example Combine these 3 datasets
- (ID, zip code)
- (ID, birth day)
- (ID, height)
- Other example quark star Chandra Xray
Hubble optical,600 year old records..Drake,
J. J. et al. Is RX J185635-375 a Quark Star?.
Preprint, (2002).
10 Science is hitting a wallFTP and GREP are not
adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 10,000 disks
- At some point you need indices to limit
search parallel data search and analysis
search and analysis tools - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min ( 1 /GB)
- 2 days and 1K
- 3 years and 1M
11Networking Great hardware Software
- WANs _at_ 5GBps (1? 40 Gbps)
- GbpsEthernet common (100 MBps)
- Offload gives 2 hz/Byte
- Will improve with RDMA zero-copy
- 10 Gbps mainstream by 2004
- Faster I/O
- 1 GB/s today (measured)
- 10 GB/s under development
- SATA (serial ATA) 150MBps/device
12Bandwidth 3x bandwidth/year for 25 more years
- Today
- 40 Gbps per channel (?)
- 12 channels per fiber (wdm) 500 Gbps
- 32 fibers/bundle 16 Tbps/bundle
- In lab 3 Tbps/fiber (400 x WDM)
- In theory 25 Tbps per fiber
- 1 Tbps USA 1996 WAN bisection bandwidth
- Aggregate bandwidth doubles every 8 months!
13Hero/Guru Networking
Redmond/Seattle, WA
Information Sciences Institute Microsoft Qwest Uni
versity of Washington Pacific Northwest
Gigapop HSCC (high speed connectivity
consortium) DARPA
New York
Arlington, VA
San Francisco, CA
5626 km 10 hops
14Real Networking
- Bandwidth for 1 Gbps stunt cost 400k/month
- 200/Mbps/m (at each end hardware admin)
- Price not improving very fast
- Doesnt include operations / local hardware costs
- Admin costs more 1/GB to 10/GB
- Challenge Go home and FTP from a fastserver
- The Guru Gap FermiLab lt-gt JHU
- Both well connected
- vBNS, NGI, Internet2, Abilene,.
- Actual desktop-to-desktop 100KBps
- 12 days/TB (but it crashes first).
- The reality to move 10GB, mail it! TeraScale
Sneakernet ?
15How Do You Move A Terabyte?
Source TeraScale Sneakernet, Microsoft Research,
Jim Gray et. all
16There Is A Problem
Niklaus Wirth Algorithms Data Structures
Programs
- GREAT!!!!
- XML documents are portable objects
- XML documents are complex objects
- WSDL defines the methods on objects (the class)
- But will all the implementations match?
- Think of UNIX or SQL or C or
- This is a work in progress.
17Changes To DBMSs
- Integration of Programs and Data
- Put programs inside the databaseallows OODB
- Gives you parallel execution
- Integration of Relational, Text, XML, Time
- Scaleout (even more)
- AutoAdmin (no knobs)
- Manage Petascale databases (utilities, geoplex,
online, incremental)
18Publishing Data
Roles Authors Publishers Curators Archives Consume
rs
Traditional Scientists Journals Libraries Archives
Scientists
Emerging Collaborations Project web site DataDoc
Archives Digital Archives Scientists
19The Core Problem No Economic Model
- The archive user has not yet been born. How can
he pay you to curate the data? - The Scientist gathered data for his own
purposeWhy should he pay (invest time) for your
needs? - Answer to both thats the scientific method
- Curating data (documenting the design, the
acquisition and the processing)Is very hard and
there is no reward for doing it.The results are
rewarded, not the process of getting them. - Storage/archive NOT the problem (its almost
free) - Curating/Publishing is expensive.
20SDSS Data Inflation Data Pyramid
- Level 2Derived data products 10x smaller But
there are many catalogs. - Publish new edition each year
- Fixes bugs in data.
- Must preserve old editions
- Creates data pyramid
- Store each edition
- 1, 2, 3, 4 N N2 bytes
- Net Data Inflation L2 L1
- Level 1AGrows 5TB pixels/year growing to
25TB 2 TB/y compressed growing to 13TB 4
TB today (level 1A in NASA terms)
21Whats needed?(not drawn to scale)
22CS Challenges For Astronomers
- Objectify your field
- Precisely define what you are talking about.
- Objects and Methods / Attributes
- This is REALLY difficult.
- UCDs are a great start but, there is a long way
to go - Software is like entropy, it always increases.
-- Norman Augustine, Augustines Laws - Beware of legacy software cost can eat you
alive - Share software where possible.
- Use standard software where possible.
- Expect it will cost you 25 to 40 of project. ?
- Explain what you want to do with the VO
- 20 queries or something like that.
23Challenge to Data Miners Linear and Sub-Linear
Algorithms
Techniques
- Today most correlation / clustering
algorithmsare polynomial N2 or N3 or - N2 is VERY big when N is big (1018 is big)
- Need sub-linear algorithms
- Current approaches are near optimal given
current assumptions. - So, need new assumptionsprobably heuristic and
approximate
24Challenge to Data Miners Rediscover Astronomy
- Astronomy needs deep understanding of physics.
- But, some was discovered as variable
correlations then explained with physics. - Famous example Hertzsprung-Russell Diagramstar
luminosity vs color (temperature) - Challenge 1 (the student test) How much of
astronomy can data mining discover? - Challenge 2 (the Turing test)Can data mining
discover NEW correlations?
25Plumbers Organize and Search Petabytes
- Automate
- instrument-to-archive pipelinesIt is is a messy
business very labor intensiveMost current
designs do not scale (too many manual
steps)BaBar (1TB/day) and ESO pipeline seem
promising.A job-scheduling or workflow system - Physical Database design access
- Data access patterns are difficult to anticipate
- Aggressively and automatically use indexing,
sub-setting. - Search in parallel
- Goals
- Answer easy queries in 10 seconds.
- Answer hard queries (correlations) in 10 minutes.
26Scaleable Systems
- Scale UP grow by adding components to a
single system. - Scale Out grow by adding more systems.
Scale OUT
27Whats New Scale Up
- 64 bit TB size main memory
- SMP on chip everythings smp
- 32 256 SMP locality/affinity matters
- TB size disks
- High-speed LANs
28Who needs 64-bit addressing?You! Need 64-bit
addressing!
- 640K ought to be enough for anybody.
Bill Gates, 1981 - But that was 21 years ago 2?21/3
14 bits ago. - 20 bits 14 bits 34 bits so.. 16GB ought
to be enough for anybody Jim Gray,
2002 - 34 bits gt 31 bits so34 bits 64 bits
- YOU need 64 bit addressing!
2964 bit Why bother?
- 1966 Moores law 4x more RAM every 3 years.
1 bit of addressing every 18 months - 36 years later 2?36/3 24 more bits Not
exactly right, but 32 bits not enough for
servers 32 bits gives no headroom for clients - So, time is running out ( has run out )
- Good news Itanium and Hammer are maturingAnd
so is the base software (OS, drivers, DB,
Web,...)Windows SQL _at_ 256GB today!
3064 bit why bother?
- Memory intensive calculations
- You can trade memory for IO and processing
- Example Data Analysis Clustering a JHU
- in memory CPU time is NlogN , N 100M
- Disk M chunks ? time M2
- must run many times
- Now running on HP Itanium Windows.Net Server
2003 SQL Server
Graph courtesy of Alex Szalay Adrian Pope of
Johns Hopkins University
31Amdahls balanced System Laws
- 1 mips needs 4 MB ram and needs 20 IO/s
- At 1 billion instructions per secondneed 4
GB/cpuneed 50 disks/cpu! - 64 cpus 3,000 disks
1 bips cpu
4 GB RAM
50 disks 10,000 IOps 7.5 TB
32The 5 Minute Rule Trade RAM for Disk Arms
- If data re-referenced every 5 minutes It is
cheaper to cache it in ram than to get it
from diskA disk access/second 50 or
50MB for 1 second or 50KB for
1,000 seconds. - Each app has a memory knee Up to the knee,
more memory helps a lot.
3364 bit Reduces IO, saves disks
- Large memory reduces IO
- 64-bit simplifies code
- Processors can be faster (wider word)
- Ram is cheap (4 GB 1k to 20k)
- Can trade ram for disk IO
- Better response time.
- Example
- tpcC
- 4x1Ghz Itanium2 vs
- 4x1.6Ghz IA32
- 40 extra GB ? 60 extra throughput
4x1.6Ghz IA32 8GB
4x1 Ghz IA64 48GB
4x1.6Ghz IA32 32GB
34AMD Hammer Coming Soon
- AMD Hammer is 64bit capable
- 2003 millions of Hammer CPUs will ship
- 2004 most AMD CPUs will be 64bit
- 4GB ram is less than 1,000 today less than
500 in 2004 - Desktops (Hammer) and servers (Opteron).
- You do the math,Who will demand 64bit capable
software?
35A 1TB Main Memory
- Amdahls law 1mips/MB , now 15so 20 x 10 Ghz
cpus need 1TB ram - 1TB ram 250k 2m today 25k 200k
in 5 years - 128 million pages
- Takes a LONG time to fill
- Takes a LONG time to refill
- Needs new algorithms
- Needs parallel processing
- Which leads us to
- The memory hierarchy
- smp
- numa
36Hyper-Threading SMP on chip
- If cpu is always waiting for memoryPredict
memory requests and prefetch - done
- If cpu still always waiting for
memoryMulti-program it (multiple hardware
threads per cpu) - Hyper Threading Everything is SMP
- 2 now more later
- Also multiple cpus/chip
- If your program is single threaded
- You waste ½ the cpu and memory bandwidth
- Eventually waste 80
- App builders need to plan for threads.
37The Memory Hierarchy
- Locality REALLY matters
- CPU 2 G hz, RAM at 5 MhzRAM is no longer random
access. - Organizing the code gives 3x (or more)
- Organizing the data gives 3x (or more)
- Level latency (clocks) size
- Registers 1 1 KB
- L1 2 32 KB
- L2 10 256 KB
- L3 30 4 MB
- Near RAM 100 16 GB
- Far RAM 300 64 GB
38(No Transcript)
39Scaleup Systems Non-Uniform Memory Architecture
(NUMA)Coherent but remote memory is even
slower
All cells see a common memory Slow local main
memory Slower remote main memory
Partition manager
Service Processor
Scaleup by adding cells Planning for 64 cpu, 1TB
ram
Config DB
Service Processor
Interconnect, Service Processor, Partition
management are vendor specific Several vendors
doing thisItanium and Hammer
System interconnect Crossbar/Switch
40Changed Ratios Matter
- If everything changes by 2x, Then nothing
changes. - So, it is the different rates that matter.
Slowly changing Speed of light People
costs Memory bandwidth WAN prices
Improving FAST CPU speed Memory disk
size Network Bandwidth
41Disks are becoming tapes
- Capacity
- 150 GB now, 300 GB this year, 1 TB by
2007 - Bandwidth
- 40 MBps now150 MBps by 2007
- Read time
- 2 hours sequential, 2 days random now4 hours
sequential, 12 days random by 2007
150 GB
150 IO/s 40 MBps
1 TB
200 IO/s 150 MBps
42Disks are becoming tapesConsequences
- Use most disk capacity for archivingCopy on
Write (COW) file system in Windows and other
OSs. - RAID10 saves arms, costs space (OK!).
- Backup to diskPretend it is a 100GB disk 1 TB
disk - Keep hot 10 of data on fastest part of disk.
- Keep cold 90 on colder part of disk
- Organize computations to read/write disks
sequentially in large blocks.
43Wiring is going serial and getting FAST!
- Gbps Ethernet and SATA built into chips
- Raid Controllers inexpensive and fast.
- 1U storage bricks _at_ 2-10 TB
- SAN or NAS (iSCSI or CIFS/DAFS)
44NAS SAN Horse Race
- Storage Hardware 1k/TB/yStorage
Management 10k...300k/TB/y - So as with Server ConsolidationStorage
Consolidation - Two styles NAS (Network Attached
Storage) File Server SAN (System Area
Network) Disk Server - I believe NAS is more manageable.
45SAN/NAS Evolution
Monolithic
Modular
Sealed
46IO ThroughputK Access Per Second Vs. RPM
Kaps vs. RPM
Kaps
47Comparison Of Disk Costs for similar
performance
Seagate Disk Prices
Source Seagate online store, quantity one prices
48Comparison Of Disk Costs /MB for different
systems
Source Dell
49Why Serial ATA Matters
- Modern interconnect
- Point-to-point drive connection
- 150Mbs gt 300Mbs
- Facilitates ATA disk arrays
- Enables inexpensivecool storage
50Performance (on Y2k SDSS data)
- Run times on 15k HP Server (2 cpu, 1 GB , 8
disk) - Some take 10 minutes
- Some take 1 minute
- Median 22 sec.
- Ghz processors are fast!
- (10 mips/IO, 200 ins/byte)
- 2.5 m rec/s/cpu
1,000 IO/cpu sec 64 MB IO/cpu sec
51NVO How Will It Work?
- Define commonly used atomic services
- Build higher level toolboxes/portals on top
- We do not build everything for everybody
- Use the 90-10 rule
- Define the standards and interfaces
- Build the framework
- Build the 10 of services that are used by 90
- Let the users build the rest from the components
52Data Federations of Web Services
- Massive datasets live near their owners
- Near the instruments software pipeline
- Near the applications
- Near data knowledge and curation
- Super Computer centers become Super Data Centers
- Each Archive publishes a web service
- Schema documents the data
- Methods on objects (queries)
- Scientists get personalized extracts
- Uniform access to multiple Archives
- A common global schema
Federation
53Grid and Web Services Synergy
- I believe the Grid will be many web
services share data (computrons are free) - IETF standards Provide
- Naming
- Authorization / Security / Privacy
- Distributed Objects
- Discovery, Definition, Invocation, Object Model
- Higher level services workflow, transactions,
DB,.. - Synergy commercial Internet Grid tools
54Web Services The Key?
- Web SERVER
- Given a url parameters
- Returns a web page (often dynamic)
- Web SERVICE
- Given a XML document (soap msg)
- Returns an XML document
- Tools make this look like an RPC.
- F(x,y,z) returns (u, v, w)
- Distributed objects for the web.
- naming, discovery, security,..
- Internet-scale distributed computing
Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
55Grid?
- Harvesting spare cpu cycles is not important
- They are free (1/cpu day)
- They need applications and data (which are not
free) (1/GB shipped) - Accessing distributed data IS important
- Send the programs to the data
- Send the questions to the databases.
- Super Computer Centers become Super Data
Centers Super Application Centers
56The Grid Foster Kesselman (Argonne National
Laboratory)
Internet computing and GRID technologies promise
to change the way we tackle complex problems.
They will enable large-scale aggregation and
sharing of computational, data and other
resources across institutional boundaries .
Transform scientific disciplines ranging from
high energy physics to the life sciences
57Grid/Globus
- Leader of the pack for GRID middleware
- Layered software toolkit
- 1 Grid Fabric (OS, TCP)
- 2 Grid Services Globus Resource Allocation
Manager Globus Information Service
(meta-computing directory service) Grid
Security Infrastructure GridFTP - 3 Application Toolkits Job submission MPICH-G
2 message passing interface - 4Specific Applications OVERFLOW Navier-Stokes
flow solver
58Globus in gory detail
- SHELL SCRIPTS
- globus-mds-search '((hndenali.mcs.anl.gov)(objec
tclassGlobusSystemDynamicInformation))' cpuload1
\ - sed -n -e '/hn/p' -e '/cpuload1/p' \
- sed -e 's/,.//' -e 's// /g' \
- awk '/hn/printf "s", 2
/cpuload/printf " s\n", 2 - if -eq 0 then
- echo "provide argument ltnumber of processes to
startgt" 1gt2 - exit 1
- fi
- if -z "GRAMCONTACT" then
- GRAMCONTACT"globus-hostname2contacts -type
fork pitcairn.mcs.anl.gov" - fi
- pwd/bin/pwd
- rsl"(executablepwd/myjobtest)(count1)"
- archGLOBUS_INSTALL_PATH/sbin/config.guess
- GLOBUS_INSTALL_PATH/tools/arch/bin/globusrun
-o -r "GRAMCONTACT" "rsl"
- LIBRARIES
- / get process id and hostname /
- pid getpid()
- rc globus_libc_gethostname(hn, 256)
- globus_assert(rc GLOBUS_SUCCESS)
- / get current time and convert to string
format. setting 25 to zero will strip the
newline character. / - mytime time(GLOBUS_NULL)
- timestr globus_libc_ctime_r( mytime, buf,
30 ) - timestr25 '\0'
- globus_libc_printf("s process d on s
came to \ life\n",timestr, pid, hn) - /THE BARRIER!!! /
- globus_duroc_runtime_barrier()
- /Passed the barrier get current time again
and print it out./ - mytime time(GLOBUS_NULL)
- timestr globus_libc_ctime_r( mytime, buf,
30 ) - globus_libc_printf("s process d on s
passed \the barrier\n", timestr, pid, hn) - /TODO 1 get the layout of the DUROC job
using first globus_duroc_runtime_intra_subjob_r
ank() and then globus_duroc_runtime_inter_subj
ob_structure(). / - / We are done./
- rc globus_module_deactivate_all()
59Shielding Users
- Users do not want to deal with XML,they want
their data - Users do not want to deal with configuring grid
computing, they want results - SOAP data appears in user memory, XML is
invisible - SOAP call just a remote procedure
60Atomic Services
- Metadata information about resources
- Waveband
- Sky coverage
- Translation of names to universal dictionary
(UCD) - Simple search patterns on the resources
- Cone Search
- Image mosaic
- Unit conversions
- Simple filtering, counting, histogramming
- On-the-fly recalibrations
61Higher Level Services
- Built on Atomic Services
- Perform more complex tasks
- Examples
- Automated resource discovery
- Cross-identifications
- Photometric redshifts
- Outlier detections
- Visualization facilities
- Expectation
- Build custom portals in matter of days from
existing building blocks (like today in IRAF or
IDL)
62SkyQuery
- Distributed Query tool using a set of services
- Feasibility study, built in 6 weeks from scratch
- Tanu Malik (JHU CS grad student)
- Tamas Budavari (JHU astro postdoc)
- Implemented in C and .NET
- Won 2nd prize of Microsoft XML Contest
- Allows queries like
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
63Architecture
Web Page
Image cutout
SkyQuery
SkyNodeSDSS
SkyNode2Mass
SkyNodeFirst
64Cross-id Steps
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND
AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) gt
2 AND o.type3
- Parse query
- Get counts
- Sort by counts
- Make plan
- Cross-match
- Recursively, from small to large
- Select necessary attributes only
- Return output
- Insert cutout image
65Show Cutout Web Service