Title: Scaleable Computing Jim Gray Microsoft Corporation Gray@Microsoft.com
1Scaleable ComputingJim GrayMicrosoft
CorporationGray_at_Microsoft.com
2Thesis Scaleable Servers
- Scaleable Servers
- Commodity hardware allows new applications
- New applications need huge servers
- Clients and servers are built of the same stuff
- Commodity software and
- Commodity hardware
- Servers should be able to
- Scale up (grow node by adding CPUs, disks,
networks) - Scale out (grow by adding nodes)
- Scale down (can start small)
- Key software technologies
- Objects, Transactions, Clusters, Parallelism
31987 256 tps Benchmark
- 14 M computer (Tandem)
- A dozen people
- False floor, 2 rooms of machines
Admin expert
Hardware experts
A 32 node processor array
Auditor
Network expert
Simulate 25,600 clients
Manager
Performance expert
OS expert
DB expert
A 40 GB disk array (80 drives)
41988 DB2 CICS Mainframe65 tps
- IBM 4391
- Simulated network of 800 clients
- 2m computer
- Staff of 6 to do benchmark
2 x 3725 network controllers
Refrigerator-sized CPU
16 GB disk farm 4 x 8 x .5GB
51997 10 years later1 Person and 1 box 1250 tps
- 1 Breadbox 5x 1987 machine room
- 23 GB is hand-held
- One person does all the work
- Cost/tps is 1,000x less25 micro dollars per
transaction
4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk
Hardware expert OS expert Net expert DB
expert App expert
3 x7 x 4GB disk arrays
6What Happened?
- Moores law Things get 4x better every 3
years (applies to computers, storage, and
networks) - New Economics Commodityclass price/mips
software /mips
k/yearmainframe 10,000 100 minicomputer
100 10microcomputer 10
1 - GUI Human - computer tradeoffoptimize for
people, not computers
7What Happens Next
- Last 10 years 1000x improvement
- Next 10 years ????
- Today text and image servers are free 25
m/hit gt advertising pays for them - Futurevideo, audio, servers are freeYou
aint seen nothing yet!
8Kinds Of Information Processing
Point-to-point
Broadcast
Lecture Concert
Conversation Money
Network
Immediate
Book Newspaper
Mail
Time-shifted
Database
Its ALL going electronic Immediate is being
stored for analysis (so ALL database) Analysis
and automatic processing are being added
9Why Put EverythingIn Cyberspace?
Point-to-point OR broadcast
Low rent - min /byte Shrinks time - now
or later Shrinks space - here or
there Automate processing - knowbots
Network
Immediate OR time-delayed
Locate Process Analyze Summarize
Database
10Magnetic Storage Cheaper Than Paper
- File cabinet cabinet (four drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3/sheet - Disk disk (4 GB ) 800 ASCII 2 mil pages
0.04/sheet (80x cheaper) - Image 200,000 pages 0.4/sheet (8x cheaper)
- Store everything on disk
11DatabasesInformation at Your Fingertips
Information NetworkKnowledge Navigator
- All information will be in anonline database
(somewhere) - You might record everything you
- Read 10MB/day, 400 GB/lifetime(eight tapes
today) - Hear 400MB/day, 16 TB/lifetime(three
tapes/year today) - See 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe
someday)
12Database StoreALL Data Types
- The new world
- Billions of objects
- Big objects (1 MB)
- Objects have behavior (methods)
- The old world
- Millions of objects
- 100-byte objects
- Paperless office
- Library of Congress online
- All information online
- Entertainment
- Publishing
- Business
- WWW and Internet
People
Name
Address
Papers
Picture
Voice
NY
David
Mike
Berk
Won
Austin
13Billions Of Clients
- Every device will be intelligent
- Doors, rooms, cars
- Computing will be ubiquitous
14Billions Of ClientsNeed Millions Of Servers
- All clients networked to servers
- May be nomadicor on-demand
- Fast clients wantfaster servers
- Servers provide
- Shared Data
- Control
- Coordination
- Communication
Clients
Mobileclients
Fixedclients
Servers
Server
Super server
15ThesisMany little beat few big
1 million
100 K
10 K
Pico Processor
Micro
Nano
10 pico-second ram
1 MB
Mini
Mainframe
10
0
MB
1
0 GB
1
TB
1
00 TB
1.8"
2.5"
3.5"
5.25"
1 M SPECmarks, 1TFLOP 106 clocks to bulk
ram Event-horizon on chip VM reincarnated Multi
program cache, On-Chip SMP
9"
14"
- Smoking, hairy golf ball
- How to connect the many little parts?
- How to program the many little parts?
- Fault tolerance?
16Future Super Server4T Machine
- Array of 1,000 4B machines
- 1 bps processors
- 1 BB DRAM
- 10 BB disks
- 1 Bbps comm lines
- 1 TB tape robot
- A few megabucks
- Challenge
- Manageability
- Programmability
- Security
- Availability
- Scaleability
- Affordability
- As easy as a single system
Cyber Brick a 4B machine
Future servers are CLUSTERS of processors,
discs Distributed database techniques make
clusters work
17Performance Storage Accesses not Instructions
Executed
- In the old days we counted instructions and
IOs - Now we count memory references
- Processors wait most of the time
Where the time goes
clock ticks used by AlphaSort Components
70 MIPS real apps have worse Icache misses so
run at 60 MIPS if well tuned, 20 MIPS if not
Sort
Disc Wait
Sort
OS
Disc Wait
Memory Wait
I-Cache
Miss
B-Cache
D-Cache
Data Miss
Miss
18Storage Latency How Far Away is the Data?
Andromeda
9
Tape /Optical
10
2,000 Years
Robot
6
Pluto
Disk
2 Years
10
Clock Ticks
1.5 hr
Memory
100
This Campus
10
10 min
On Board Cache
On Chip Cache
2
This Room
Registers
1
My Head
1 min
19The Hardware Is In PlaceAnd then a miracle
occurs
?
- SNAP scaleable networkand platforms
- Commodity-distributedOS built on
- Commodity platforms
- Commodity networkinterconnect
- Enables parallel applications
20Thesis Scaleable Servers
- Scaleable Servers
- Commodity hardware allows new applications
- New applications need huge servers
- Clients and servers are built of the same stuff
- Commodity software and
- Commodity hardware
- Servers should be able to
- Scale up (grow node by adding CPUs, disks,
networks) - Scale out (grow by adding nodes)
- Scale down (can start small)
- Key software technologies
- Objects, Transactions, Clusters, Parallelism
21Scaleable ServersBOTH SMP And Cluster
Grow up with SMP 4xP6is now standard Grow out
with cluster Cluster has inexpensive parts
SMP superserver Departmentalserver Personalsy
stem
Clusterof PCs
22SMPs Have Advantages
- Single system image easier to manage, easier to
program threads in shared memory, disk, Net - 4x SMP is commodity
- Software capable of 16x
- Problems
- gt4 not commodity
- Scale-down problem (starter systems expensive)
- There is a BIGGEST one
SMP superserver Departmentalserver Personalsy
stem
23Building the Largest Node
- There is a biggest node (size grows over time)
- Today, with NT, it is probably 1TB
- We are building it (with help from DEC and SPIN2)
- 1 TB GeoSpatial SQL Server database
- (1.4 TB of disks 320 drives).
- 30K BTU, 8 KVA, 1.5 metric tons.
- Will put it on the Web as a demo app.
- 10 meter image of the ENTIRE PLANET.
- 2 meter image of interesting parts (2 of
land) One pixel per meter 500 TB
uncompressed. - Better resolution in US (courtesy of USGS).
24Whats TeraByte?
- 1 Terabyte
- 1,000,000,000 business letters 150 miles
of book shelf - 100,000,000 book pages 15 miles of
book shelf - 50,000,000 FAX images 7 miles of
book shelf - 10,000,000 TV pictures (mpeg)
10 days of video 4,000 LandSat images 16
earth images (100m) - 100,000,000 web page 10 copies of
the web HTML - Library of Congress (in ASCII) is 25 TB
-
- 1980 200 million of disc
10,000 discs - 5 million of tape silo 10,000 tapes
- 1997 200 k of magnetic disc
48 discs - 30 k nearline tape
20 tapes - Terror Byte !
25 TB DB User Interface
Next
26Tpc-C Web-Based Benchmarks
- Client is a Web browser (7,500 of them!)
- Submits
- Order
- Invoice
- Query to server via Web page interface
- Web server translates to DB
- SQL does DB work
- Net
- easy to implement
- performance is GREAT!
HTTP
IIS Web
ODBC
SQL
27TPC-C Shows How Far SMPs have come
- Performance is amazing
- 2,000 users is the min!
- 30,000 users on a 4x12 alpha cluster (Oracle)
- Peak Performance 30,390 tpmC _at_ 305/tpmC
(Oracle/DEC) - Best Price/Perf 6,712 tpmC _at_ 65/tpmC (MS
SQL/DEC/Intel) - graphs show UNIX high price diseconomy of
scaleup
28TPC C SMP Performance
- SMPs do offer speedup
- but 4x P6 is better than some 18x MIPSco
29The TPC-C Revolution Shows How Far NT and SQL
Server have Come
- Economy of scale on Windows NT
- Recent Microsoft SQL Server benchmarks are
Web-based
tpmC and /tpmC
MS
SQL Server Economy of Scale Low Price
250
DB2
200
Informix
150
Better
Price /TPM-C
Microsoft
100
Oracle
50
Sybase
0
0
1000
2000
3000
4000
5000
6000
7000
8000
Performance tpmC
30What Happens To Prices?
- No expensive UNIX front end (20/tpmC)
- No expensive TP monitor software (10/tpmC)
- gt 65/tpmC
31Grow UP and OUT
1 Terabyte DB
- Cluster
- a collection of nodes
- as easy to program and manage as a single node
1 billion transactions per day
32Clusters Have Advantages
- Clients and servers made from the same stuff
- Inexpensive
- Built with commodity components
- Fault tolerance
- Spare modules mask failures
- Modular growth
- Grow by adding small modules
- Unlimited growth no biggest one
33Windows NT clusters
- Key goals
- Easy to install, manage, program
- Reliable better than a single node
- Scaleable added parts add power
- Microsoft 60 vendors defining NT clusters
- Almost all big hardware and software vendors
involved - No special hardware needed - but it may help
- Enables
- Commodity fault-tolerance
- Commodity parallelism (data mining, virtual
reality) - Also great for workgroups!
- Initial two-node failover
- Beta testing since December96
- SAP, Microsoft, Oracle giving demos.
- File, print, Internet, mail, DB, other services
- Easy to manage
- Each node can be 4x (or more) SMP
- Next (NT5) Wolfpack is modest size cluster
- About 16 nodes (so 64 to 128 CPUs)
- No hard limit, algorithms designedto go further
34SQL Server Failover Using Wolfpack Windows NT
Clusters
- Each server owns half the database
- When one fails
- The other server takes over the shared disks
- Recovers the database and serves it
35Billion Transactions per DayProject
- Building a 20-node Windows NT Cluster (with help
from Intel)gt 800 disks - All commodity parts
- Using SQL Server DTC distributed transactions
- Each node has 1/20 th of the DB
- Each node does 1/20 th of the work
- 15 of the transactions are distributed
36How Much Is 1 Billion Transactions Per Day?
- 1 Btpd 11,574 tps (transactions per second)
700,000 tpm (transactions/minute) - ATT
- 185 million calls (peak day worldwide)
- Visa 20 M tpd
- 400 M customers
- 250,000 ATMs worldwide
- 7 billion transactions / year (cardcheque) in
1994
Millions of transactions per day
1,000.
100.
10.
Mtpd
1.
0.1
ATT
Visa
BofA
NYSE
1 Btpd
37ParallelismThe OTHER aspect of clusters
- Clusters of machines allow two kinds of
parallelism - Many little jobs online transaction processing
- TPC-A, B, C
- A few big jobs data search and analysis
- TPC-D, DSS, OLAP
- Both give automatic parallelism
38Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Program
Program
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
39Data Rivers Split Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma Exchange
operator in Volcano.
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
40Partitioned Execution
Spreads computation and IO among processors
Partitioned data gives
NATURAL parallelism
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
41N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
Jim Gray Gordon Bell VLDB 95 Parallel
Database Systems Survey
42The Parallel Law Of Computing
Grosch's Law
Parallel Law Needs Linear speedup and
linear scale-up Not always possible
2x is 4x performance
2x is2x performance
1,000 MIPS 1,000
1 MIPS 1
43Thesis Scaleable Servers
- Scaleable Servers
- Commodity hardware allows new applications
- New applications need huge servers
- Clients and servers are built of the same stuff
- Commodity software and
- Commodity hardware
- Servers should be able to
- Scale up (grow node by adding CPUs, disks,
networks) - Scale out (grow by adding nodes)
- Scale down (can start small)
- Key software technologies
- Objects, Transactions, Clusters, Parallelism
44The BIG PictureComponents and transactions
- Software modules are objects
- Object Request Broker (a.k.a., Transaction
Processing Monitor) connects objects(clients to
servers) - Standard interfaces allow software plug-ins
- Transaction ties execution of a job into an
atomic unit all-or-nothing, durable, isolated
Object Request Broker
45ActiveX and COM
- COM is Microsoft model, engine inside OLE ALL
Microsoft software is based on COM (ActiveX) - CORBA OpenDoc is equivalent
- Heated debate over which is best
- Both share same key goals
- Encapsulation hide implementation
- Polymorphism generic operationskey to GUI and
reuse - Versioning allow upgrades
- Transparency local/remote
- Security invocation can be remote
- Shrink-wrap minimal inheritance
- Automation easy
- COM now managed by the Open Group
46Linking And EmbeddingObjects are data
modulestransactions are execution modules
- Link pointer to object somewhere else
- Think URL in Internet
- Embed bytesare here
- Objects may be active can callback to subscribers
47Commodity Software ComponentsInexpensive OS,
DBMSand plug-ins
- Recent TPC-C prices
- Oracle on DEC UNIX 30.4 k tpmC _at_ 305/tpmC
- Informix on DEC UNIX 13.6 k tpmC _at_ 277/tpmC
- DB2 on Solaris 6.4 ktpmC _at_ 200/tpmC
- SQL Server on Compaq, Windows NT 7.3 ktpmC _at_
65/tpmC (using Web, no TP monitor!) - Oracle on Windows NT 3.1 ktpmC _at_ 198/tpmC
- Net Open solutionscan do even biggest jobs
thousands of online users per node of cluster - ActiveX, VBX, andJava plug-ins
- Spreadsheets, GeoQuery, FAX, voice, image
libraries, commodity component market
48Objects Meet DatabasesThe basis for universal
data servers, access, integration
- object-oriented (COM oriented) programming
interface to data - Breaks DBMS into components
- Anything can be a data source
- Optimization/navigation on top of other data
sources - A way to componentized a DBMS
- Makes an RDBMS and O-RDBMS (assumes optimizer
understands objects)
DBMS engine
49The Pattern Three Tier Computing
Presentation
- Clients do presentation, gather input
- Clients do some workflow (Xscript)
- Clients send high-level requests to ORB (Object
Request Broker) - ORB dispatches workflows and business objects --
proxies for client, orchestrate flows queues - Server-side workflow scripts call on distributed
business objects to execute task
workflow
Business Objects
Database
50The Three Tiers
Object Data server.
51Why Did Everyone Go To Three-Tier?
- Manageability
- Business rules must be with data
- Middleware operations tools
- Performance (scaleability)
- Server resources are precious
- ORB dispatches requests to server pools
- Technology Physics
- Put UI processing near user
- Put shared data processing near shared data
Presentation
workflow
Business Objects
Database
52Why Put Business Objects at Server?
53What Middleware Does ORB, TP Monitor, Workflow
Mgr, Web Server
- Registers transaction programs workflow and
business objects (DLLs) - Pre-allocates server pools
- Provides server execution environment
- Dynamically checks authority (request-level
security) - Does parameter binding
- Dispatches requests to servers
- parameter binding
- load balancing
- Provides Queues
- Operator interface
54Server Side Objects Easy Server-Side Execution
A Server
- Give simple execution environment
- Object gets
- start
- invoke
- shutdown
- Everything else is automatic
- Drag Drop Business Objects
Network
Receiver
Queue
Management
Connections
Context
Security
Configuration
Thread Pool
Service logic
Synchronization
Shared Data
55A new programming paradigm
- Develop object on the desktop
- Better yet download them from the Net
- Script work flows as method invocations
- All on desktop
- Then, move work flows and objects to server(s)
- Gives
- desktop development
- three-tier deployment
- Software Cyberbricks
56Transactions Coordinate Components (ACID)
- Transaction properties
- Atomic all or nothing
- Consistent old and new values
- Isolated automatic locking or versioning
- Durable once committed, effects survive
- Transactions are built into modern OSs
- MVS/TM Tandem TMF, VMS DEC-DTM, NT-DTC
57Transactions Objects
- Application requests transaction identifier (XID)
- XID flows with method invocations
- Object Managers join (enlist)in transaction
- Distributed Transaction Manager coordinates
commit/abort
58Transactions Coordinate Components (ACID)
- Programmers view bracket a collection of
actions - A simple failure model
- Only two outcomes
Begin() action action action
action Commit()
Begin() action action action Rollback()
Begin() action action action Rollback()
Fail !
Success!
Failure!
59Distributed Transactions Enable Huge Throughput
- Each node capable of 7 KtmpC (7,000 active
users!) - Can add nodes to cluster (to support 100,000
users) - Transactions coordinate nodes
- ORB / TP monitor spreads work among nodes
60Distributed Transactions Enable Huge DBs
- Distributed database technology spreads data
among nodes - Transaction processing technology manages nodes
61Thesis Scaleable Servers
- Scaleable Servers Built from Cyberbricks
- Allow new applications
- Servers should be able to
- Scale up, out, down
- Key software technologies
- Clusters (ties the hardware together)
- Parallelism (uses the independent cpus, stores,
wires - Objects (software CyberBricks)
- Transactions masks errors.
62Computer Industry Laws (Rules of thumb)
- Metcalfs law
- Moores first law
- Bells computer classes (7 price tiers)
- Bells platform evolution
- Bells platform economics
- Bills law
- Software economics
- Groves law
- Moores second law
- Is info-demand infinite?
- The death of Groschs law
63Metcalfs LawNetwork Utility Users2
- How many connections can it make?
- 1 user no utility
- 100,000 users a few contacts
- 1 million users many on Net
- 1 billion users everyone on Net
- That is why the Internet is so hot
- Exponential benefit
64Moores First Law
- XXX doubles every 18 months 60 increase per
year - Micro processor speeds
- Chip density
- Magnetic disk density
- Communications bandwidthWAN bandwidth
approaching LANs - Exponential growth
- The past does not matter
- 10x here, 10x there, soon youre talking REAL
change - PC costs decline faster than any other platform
- Volume and learning curves
- PCs will be the building bricks of all future
systems
65Bumps In The Moores Law Road
- DRAM
- 1988 United States anti-dumping
rules - 1993-1995 ?price flat
- Magnetic disk
- 1965-1989 10x/decade
- 1989-1996 4x/3year! 100X/decade
66Gordon Bells 1975 VAX Planning Model... He
Didnt Believe It!
System Price 5 x 3 x .04 x memory size/ 1.26
(t-1972) K
- 5x Memory is20 of cost3x DEC markup.04x
per byte - He didnt believethe projection500 machine
- He couldntcomprehendthe implications
67Gordon Bells ProcessingMemories, And Comm 100
Years
Sec. Mem.
Processing
Pri. Mem
Backbone
POTS(bps)
68Gordon Bells Seven Price Tiers
- 10 wrist watch computers
- 100 pocket/ palm computers
- 1,000 portable computers
- 10,000 personal computers (desktop)
- 100,000 departmental computers
(closet) - 1,000,000 site computers (glass house)
- 10,000,000 regional computers (glass
castle)
Super server costs more than 100,000Mainframe
costs more than 1 million Must be an array
of processors, disks, tapes, comm ports
69Bells Evolution Of Computer Classes
Technology enables two evolutionary paths 1.
constant performance, decreasing cost 2.
constant price, increasing performance
1.26 2x/3 yrs -- 10x/decade 1/1.26 .8 1.6
4x/3 yrs --100x/decade 1/1.6 .62
70Gordon Bells Platform Economics
- Traditional computers custom or semi-custom,
high-tech and high-touch - New computers high-tech and no-touch
100000
10000
Price (K)
1000
Volume (K)
Applicationprice
100
10
1
0.1
0.01
Mainframe
WS
Browser
Computer type
71Software Economics
Microsoft 9 billion
- An engineer costs about150,000/year
- RD gets 515of budget
- Need 3 million1 million revenue per
engineer
Profit 24
RD 16
SGA 34
Tax 13
Productand Service 13
Intel 16 billion
IBM 72 billion
Oracle 3 billion
Profit 15
Profit 6
RD 9
RD 8
Profit
22
Tax 7
SGA
11
Tax
SGA
12
PS 59
43
PS 47
PS 26
72Software Economics Bills Law
Fixed_
Cost
Price
Marginal _Cost
Units
- Bill Joys law (Sun) dont write software for
less than 100,000 platforms _at_10 million
engineering expense, 1,000 price - Bill Gates lawdont write software for less
than 1,000,000 platforms _at_10 engineering
expense, 100 price - Examples
- UNIX versus Windows NT 3,500 versus 500
- Oracle versus SQL-Server 100,000 versus 6,000
- No spreadsheet or presentation pack on
UNIX/VMS/... - Commoditization of base software and hardware
73Groves LawThe New Computer Industry
- Horizontal integrationis new structure
- Each layer picks best from lower layer
- Desktop (C/S) market
- 1991 50
- 1995 75
Example
Function
Operation
ATT
Integration
EDS
Applications
SAP
Middleware
Oracle
Baseware
Microsoft
Systems
Compaq
Intel Seagate
Silicon Oxide
74Moores Second Law
- The cost of fab linesdoubles every generation
(three years) - Money limit hard to imagine
- 10-billion line
- 20-billion line
- 40-billion line
- Physical limit
- Quantum effects at 0.25 micron now 0.05 micron
seems hard 12 years, three generations - Lithograph need Xray below 0.13 micron
75Constant Dollars Versus Constant Work
- Constant work
- One SuperServer can doall the worlds
computations
- Constant dollars
- The world spends 10 oninformation processing
- Computers are moving from5 penetration to 50
- 300 billion to 3 trillion
- We have the patenton the byte and algorithm
76Crossing The Chasm
New market
No product no customers
Product finds customers
Hard
Veryhard
Old market
Hard
Boring competitive slow growth
Customers find product
Old technology
New technology