CyberBricks: The future of Database And Storage Engines Jim Gray http://research.Microsoft.com/~Gray - PowerPoint PPT Presentation

About This Presentation
Title:

CyberBricks: The future of Database And Storage Engines Jim Gray http://research.Microsoft.com/~Gray

Description:

What storage things are coming from Microsoft? TerraServer: a 1 TB DB on the Web ... storage size & cost 100x. Things staying about the same. speed of light ... – PowerPoint PPT presentation

Number of Views:268
Avg rating:3.0/5.0
Slides: 73
Provided by: jimg178
Category:

less

Transcript and Presenter's Notes

Title: CyberBricks: The future of Database And Storage Engines Jim Gray http://research.Microsoft.com/~Gray


1
CyberBricksThe future of Database And Storage
Engines Jim Grayhttp//research.Microsoft.com/G
ray

2
Outline
  • What storage things are coming from Microsoft?
  • TerraServer a 1 TB DB on the Web
  • Storage Metrics Kaps, Maps, Gaps, Scans
  • The future of storage ActiveDisks

3
New Storage Software From Microsoft
  • SQL Server 7.0
  • Simplicity Auto-most-things
  • Scalability on Win95 to Enterprise
  • Data warehousing built-in OLAP, VLDB
  • NT 5
  • Better volume management (from Veritas)
  • HSM architecture
  • Intellimirror
  • Active directory for transparency

4
Thin Client SupportTSO comes to NT
  • Lower Per-Client cost
  • Huge centralized data stores.

Hydra Server
5
Windows NT 5.0Intelli-Mirror
  • Files and settings mirrored on client and server
  • Great for mobile users
  • Facilitates roaming
  • Easy to replace PCs
  • Optimizes network performance
  • Means HUGE data stores

6
Outline
  • What storage things are coming from Microsoft?
  • TerraServer a 1 TB DB on the Web
  • Storage Metrics Kaps, Maps, Gaps, Scans
  • The future of storage ActiveDisks

7
Microsoft TerraServer Scaleup to Big Databases
  • Build a 1 TB SQL Server database
  • Data must be
  • 1 TB
  • Unencumbered
  • Interesting to everyone everywhere
  • And not offensive to anyone anywhere
  • Loaded
  • 1.5 M place names from Encarta World Atlas
  • 3 M Sq Km from USGS (1 meter resolution)
  • 1 M Sq Km from Russian Space agency (2 m)
  • On the web (worlds largest atlas)
  • Sell images with commerce server.

8
Microsoft TerraServer Background
  • Earth is 500 Tera-meters square
  • USA is 10 tm2
  • 100 TM2 land in 70ºN to 70ºS
  • We have pictures of 6 of it
  • 3 tsm from USGS
  • 2 tsm from Russian Space Agency
  • Compress 51 (JPEG) to 1.5 TB.
  • Slice into 10 KB chunks
  • Store chunks in DB
  • Navigate with
  • Encarta Atlas
  • globe
  • gazetteer
  • StreetsPlus in the USA
  • Someday
  • multi-spectral image
  • of everywhere
  • once a day / hour

9
USGS Digital Ortho Quads (DOQ)
  • US Geologic Survey
  • 4 Tera Bytes
  • Most data not yet published
  • Based on a CRADA
  • Microsoft TerraServer makes data available.

10
Russian Space Agency(SovInfomSputnik) SPIN-2
(Aerial Images is Worldwide Distributor)
  • 1.5 Meter Geo Rectified imagery of (almost)
    anywhere
  • Almost equal-area projection
  • De-classified satellite photos (from 200 KM),
  • More data coming (1 m)
  • Selling imagery on Internet.
  • Putting 2 tm2 onto Microsoft TerraServer.

SPIN-2
11
Demo
http//www.TerraServer.Microsoft.com/
12
Demo
  • navigate by coverage map to White House
  • Download image
  • buy imagery from USGS
  • navigate by name to Venice
  • buy SPIN2 image Kodak photo
  • Pop out to Expedia street map of Venice
  • Mention that DB will double in next 18 months (2x
    USGS, 2X SPIN2)

13
Hardware
Map
Site
Server
Internet
Servers
100 Mbps
Ethernet Switch
Web Servers
Alpha
Enterprise Storage Array
STK
Server
9710
8400
DLT
Tape
8 x 440MHz
Library
Alpha
cpus
10 GB DRAM
1TB Database Server AlphaServer 8400 4x400. 10
GB RAM 324 StorageWorks disks 10 drive tape
library (STC Timber Wolf DLT7000 )
14
The Microsoft TerraServer Hardware
  • Compaq AlphaServer 8400
  • 8x400Mhz Alpha cpus
  • 10 GB DRAM
  • 324 9.2 GB StorageWorks Disks
  • 3 TB raw, 2.4 TB of RAID5
  • STK 9710 tape robot (14 TB)
  • WindowsNT 4 EE, SQL Server 7.0

15
Software
Web Client
Internet InformationServer 4.0
ImageServer Active Server Pages
HTML
JavaViewer
The Internet
browser

MTS
Terra-ServerStored Procedures
Internet InfoServer 4.0
Internet InformationServer 4.0
SQL Server 7
MicrosoftSite Server EE
Microsoft AutomapActiveX Server
Image DeliveryApplication
SQL Server7
Automap Server
TerraServer DB
Image Provider Site(s)
16
System Management Maintenance
  • Backup and Recovery
  • STK 9710 Tape robot
  • Legato NetWorker
  • SQL Server 7 Backup Restore
  • Clocked at 80 MBps (peak)( 200 GB/hr)
  • SQL Server Enterprise Mgr
  • DBA Maintenance
  • SQL Performance Monitor

17
Microsoft TerraServer File Group Layout
  • Convert 324 disks to 28 RAID5 sets plus 28 spare
    drives
  • Make 4 WinNT volumes (RAID 50) 595 GB per
    volume
  • Build 30 20GB files on each volume
  • DB is File Group of 120 files

18
Image Delivery and LoadIncremental load of 4
more TB in next 18 months
DLTTape
tar
\DropN
LoadMgrDB
DoJob
Wait 4 Load
DLTTape
NTBackup
...
Cutting Machines
LoadMgr
10 ImgCutter 20 Partition 30 ThumbImg40
BrowseImg 45 JumpImg 50 TileImg 55 Meta
Data 60 Tile Meta 70 Img Meta 80 Update Place
ImgCutter
100mbitEtherSwitch
\DropN \Images
TerraServer
Enterprise Storage Array
STKDLTTape Library
AlphaServer8400
108 9.1 GB Drives
108 9.1 GB Drives
108 9.1 GB Drives
19
Technical ChallengeKey idea
  • Problem Geo-Spatial Search without geo-spatial
    access methods.(just standard SQL Server)
  • Solution
  • Geo-spatial search key
  • Divide earth into rectangles of 1/48th degree
    longitude (X) by 1/96th degree latitude (Y)
  • Z-transform X Y into single Z value, build
    B-tree on Z
  • Adjacent images stored next to each other
  • Search Method
  • Latitude and Longitude gt X, Y, then Z
  • Select on matching Z value

20
Some Tera-Byte Databases
Kilo Mega Giga Tera Peta Exa Zetta Yotta
  • The Web 1 TB of HTML
  • TerraServer 1 TB of images
  • Several other 1 TB (file) servers
  • Hotmail 7 TB of email
  • Sloan Digital Sky Survey 40 TB raw, 2 TB
    cooked
  • EOS/DIS (picture of planet each week)
  • 15 PB by 2007
  • Federal Clearing house images of checks
  • 15 PB by 2006 (7 year history)
  • Nuclear Stockpile Stewardship Program
  • 10 Exabytes (???!!)

21
Info Capture
  • You can record everything you see or hear or
    read.
  • What would you do with it?
  • How would you organize analyze it?

Video 8 PB per lifetime (10GBph) Audio 30 TB
(10KBps) Read or write 8 GB (words) See
http//www.lesk.com/mlesk/ksg97/ksg.html
22
Kilo Mega Giga Tera Peta Exa Zetta Yotta
A letter
A novel
A Movie
Library of Congress (text)
LoC (image)
LoC (sound cinima)
All Photos
All Disks
All Tapes
All Information!
23
Michael Lesks Points www.lesk.com/mlesk/ksg97/ks
g.html
  • Soon everything can be recorded and kept
  • Most data will never be seen by humans
  • Precious Resource Human attention
    Auto-Summarization Auto-Searchwill be a key
    enabling technology.

24
Outline
  • What storage things are coming from Microsoft?
  • TerraServer a 1 TB DB on the Web
  • Storage Metrics Kaps, Maps, Gaps, Scans
  • The future of storage ActiveDisks

25
Storage Latency How Far Away is the Data?
9
Tape /Optical
10
Robot
6
Disk
10
Memory
100
10
On Board Cache
On Chip Cache
2
Registers
1
26
DataFlow ProgrammingPrefetch Postwrite Hide
Latency
Can't wait for the data to arrive (2,000
years!) Need a memory that gets the data in
advance ( 100MB/S) Solution Pipeline
data to/from the processor Pipe data from
source (tape, disc, ram...) to cpu cache
27
MetaMessage Technology Ratios Are Important
  • If everything gets fastercheaper at the same
    rate THEN nothing really changes.
  • Things getting MUCH BETTER
  • communication speed cost 1,000x
  • processor speed cost 100x
  • storage size cost 100x
  • Things staying about the same
  • speed of light (more or less constant)
  • people (10x more expensive)
  • storage speed (only 10x better)

28
Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
29
Storage Ratios Changed in Last 20 Years
  • MediaPrice 4000X, Bandwidth 10X, Access/s 10X
  • DRAMDISK /MB 1001 ? 251
  • TAPE DISK /GB 1001 ? 51

30
Storage Ratios Changed
  • DRAM/disk media price ratio
  • 1970-1990 1001
  • 1990-1995 101
  • 1995-1997 501
  • today .15pMB disk 5 pMB dram
  • 4,000x lower media price
  • Capacity 100X, Bandwidth 10X, Access/s 10X
  • DRAMDISK /MB 1001 ? 251
  • TAPE DISK /GB 1001 ? 51

31
Disk Access Time
  • Access time SeekTime 6 ms 5/y
    RotateTime 3 ms 5/y
    ReadTime 1 ms 25/y
  • Other useful facts
  • Power rises more than size3 (so small is indeed
    beautiful)
  • Small devices are more rugged
  • Small devices can use plastics (forces are much
    smaller)e.g. bugs fall without breaking anything

32
Standard Storage Metrics
  • Capacity
  • RAM MB and /MB today at 100MB 1/MB
  • Disk GB and /GB today at 10GB and 50/GB
  • Tape TB and /TB today at .1TB and 10/GB
    (nearline)
  • Access time (latency)
  • RAM 100 ns
  • Disk 10 ms
  • Tape 30 second pick, 30 second position
  • Transfer rate
  • RAM 1 GB/s
  • Disk 5 MB/s - - - Arrays can go to 1GB/s
  • Tape 3 MB/s - - - not clear that striping
    works

33
New Storage Metrics Kaps, Maps, Gaps, SCANs
  • Kaps How many kilobyte objects served per second
  • the file server, transaction procssing metric
  • Maps How many megabyte objects served per second
  • the Mosaic metric
  • Gaps How many gigabyte objects served per hour
  • the video EOSDIS metric
  • SCANS How many scans of all the data per day
  • the data mining and utility metric
  • And /Kaps, /Maps, /Gaps, /SCAN

34
How To Get Lots of Maps, Gaps, SCANS
  • parallelism use many little devices in parallel

At 10 MB/s 1.2 days to scan
1,000 x parallel 100 seconds/scan
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
35
Tape Optical Beware of the Media Myth
Optical is cheap 200 /platter
2 GB/platter gt 100/GB (5x
cheaper than disc) Tape is cheap 100 /tape
40 GB/tape gt 2.5 /GB (100x
cheaper than disc).
36
Tape Optical Reality Media is 10 of System
Cost
Tape needs a robot (10 k ... 3 m ) 10 ...
1000 tapes (at 40GB each) gt 20/GB ... 200/GB
(1x10x cheaper than disc) Optical needs a
robot (50 k ) 100 platters 200GB ( TODAY )
gt 250 /GB ( more expensive than disc )
Robots have poor access times Not good for
Library of Congress (25TB) Data motel data
checks in but it never checks out!
37
The Access Time Myth
  • The Myth seek or pick time dominates
  • The reality (1) Queuing dominates
  • (2) Transfer dominates BLOBs
  • (3) Disk seeks often short
  • Implication many cheap servers better than
    one fast expensive server
  • shorter queues
  • parallel transfer
  • lower cost/access and cost/byte
  • This is obvious for disk tape arrays

38
My Solution to Tertiary StorageTape Farms, Not
Mainframe Silos
100 robots
1M
40TB
25/GB
3K Maps
10K robot
1.5K Gaps
10 tapes
2 Scans
400 GB
6 MB/s
25/GB
Scan in 12 hours. many independent tape
robots (like a disc farm)
30 Maps
15 Gaps 2 Scans
39
The Metrics Disk and Tape Farms Win
Data Motel Data checks in, but it never checks
out
GB/K
1
,
000
,
000
Kaps
100
,
000
Maps
Scans
10
,
000
SCANS/Day
1
,
000
100
10
1
0.1
0.01
1000 x
D
i
sc Farm
100x DLT
Tape Farm
STK Tape Robot
6,000 tapes, 8 readers
40
Cost Per Access (3-year)
540
,000
500K
67
,000
100,000
Kaps/
Maps/
Gaps/
100
68
SCANS/k
23
120
10
4.3
7
7
100
2
1.5
1
0.2
0.1
1000 x Disc Farm
STK Tape Robot
100x DLT Tape Farm
6,000 tapes, 16
readers
41
Storage Ratios Impact on Software
  • Gone from 512 B pages to 8192 B pages (will go
    to 64 KB pages in 2006)
  • Treat disks as tape
  • Increased use of sequential access
  • Use disks for backup copies
  • Use tape for
  • VERY COLD data or
  • Offsite Archive
  • Data interchange

42
Summary
  • Storage accesses are the bottleneck
  • Accesses are getting larger (Maps, Gaps, SCANS)
  • Capacity and cost are improvingBUT
  • Latencies and bandwidth are not improving muchSO
  • Use parallel access (disk and tape farms)
  • Use sequential access (scans)

43
The Memory Hierarchy
  • Measuring Modeling Sequential IO
  • Where is the bottleneck?
  • How does it scale with
  • SMP, RAID, new interconnects

Goals balanced bottlenecks Low overhead Scale
many processors (10s) Scale many disks (100s)
Memory
App address space
Mem bus
File cache
Controller
Adapter
SCSI
PCI
44
Sequential IO your mileage will vary
  • Measuring hardware Software
  • Looking for software fixes..
  • Aiming for out of the box 1/2 power point
    50 of peak powerout of the box
  • 40 MB/sec Advertised UW SCSI
  • 35r-23w MB/sec Actual disk transfer
  • 29r-17w MB/sec 64 KB request (NTFS)
  • 9 MB/sec Single disk media
  • 3 MB/sec 2 KB request (SQL Server)

45
PAP (peak advertised Performance) vs RAP (real
application performance)
  • Goal RAP PAP / 2 (the half-power point)

System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
10-15 MBps
Data
7.2 MB/s
File System
SCSI
Buffers
Disk
133 MBps
PCI
7.2 MB/s
46
The Best Case Temp File, NO IO
  • Temp file Read / Write File System Cache
  • Program uses small (in cpu cache) buffer.
  • So, write/read time is bus move time (3x better
    than copy)
  • Paradox fastest way to move data is to write
    then read it.
  • This hardware islimited to 150 MBpsper
    processor

47
Bottleneck Analysis
  • Drawn to linear scale

Theoretical Bus Bandwidth 422MBps 66 Mhz x 64
bits
MemoryRead/Write 150 MBps
MemCopy 50 MBps
Disk R/W 9MBps
48
3 Stripes and Your Out!
  • CPU time goes down with request size
  • Ftdisk (striping is cheap)
  • 3 disks can saturate adapter
  • Similar story with UltraWide

49
Parallel SCSI Busses Help
  • Second SCSI bus nearly doubles read and wce
    throughput
  • Write needs deeper buffers
  • Experiment is unbuffered(3-deep WCE)

?
2 x
50
File System Buffering Stripes(UltraWide Drives)
  • FS buffering helps small reads
  • FS buffered writes peak at 12MBps
  • 3-deep async helps
  • Write peaks at 20 MBps
  • Read peaks at 30 MBps

51
PAP vs RAP
  • Reads are easy, writes are hard
  • Async write can match WCE.

422 MBps
142
MBps
SCSI
Disks
Application
Data
40 MBps
10-15 MBps
31 MBps
File System
9 MBps

133 MBps
72 MBps
SCSI
PCI
52
Bottleneck Analysis
  • NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI 65
    MBps Unbuffered read
  • 43 MBps Unbuffered write
  • 40 MBps Buffered read
  • 35 MBps Buffered write


Adapter 30 MBps
Memory Read/Write 150 MBps
PCI 70 MBps
70 MBps
Adapter
53
Peak Thrughput on Intel/NT
  • NTFS Read/Write 24 disk, 4 SCSI, 2 PCI (64
    bit) 190 MBps Unbuffered read
  • 95 MBps Unbuffered write
  • so 0.8 TB/hr read, 0.4 TB/hr write
  • on a 25k server.


190 MBps
54
Penny Sort Ground Ruleshttp//research.microsoft.
com/barc/SortBenchmark
  • How much can you sort for a penny.
  • Hardware and Software cost
  • Depreciated over 3 years
  • 1M system gets about 1 second,
  • 1K system gets about 1,000 seconds.
  • Time (seconds) SystemPrice () / 946,080
  • Input and output are disk resident
  • Input is
  • 100-byte records (random data)
  • key is first 10 bytes.
  • Must create output file and fill with sorted
    version of input file.
  • Daytona (product) and Indy (special) categories

55
PennySort
  • Hardware
  • 266 Mhz Intel PPro
  • 64 MB SDRAM (10ns)
  • Dual Fujitsu DMA 3.2GB EIDE
  • Software
  • NT workstation 4.3
  • NT 5 sort
  • Performance
  • sort 15 M 100-byte records (1.5 GB)
  • Disk to disk
  • elapsed time 820 sec
  • cpu time 404 sec

56
Cluster Sort Conceptual Model
  • Multiple Data Sources
  • Multiple Data Destinations
  • Multiple nodes
  • Disks -gt Sockets -gt Disk -gt Disk

A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
57
Cluster Install Execute
  • If this is to be used by others,
  • it must be
  • Easy to install
  • Easy to execute
  • Installations of distributed systems take
  • time and can be tedious. (AM2, GluGuard)
  • Parallel Remote execution is
  • non-trivial. (GLUnix, LSF)
  • How do we keep this simple and built-in to
    NTClusterSort ?

58
Remote Install
  • Add Registry entry to each remote node.

RegConnectRegistry() RegCreateKeyEx()
59
Cluster Execution
  • Setup
  • MULTI_QI struct
  • COSERVERINFO struct
  • CoCreateInstanceEx()
  • Retrieve remote object handle
  • from MULTI_QI struct
  • Invoke methods as usual

60
Outline
  • What storage things are coming from Microsoft?
  • TerraServer a 1 TB DB on the Web
  • Storage Metrics Kaps, Maps, Gaps, Scans
  • The future of storage ActiveDisks

61
Crazy Disk Ideas
  • Disk Farm on a card surface mount disks
  • Disk (magnetic store) on a chip (micro machines
    in Silicon)
  • NT and BackOffice in the disk controller (a
    processor with 100MB dram)

ASIC
62
Remember Your Roots
63
Year 2002 Disks
  • Big disk (10 /GB)
  • 3
  • 100 GB
  • 150 kaps (k accesses per second)
  • 20 MBps sequential
  • Small disk (20 /GB)
  • 3
  • 4 GB
  • 100 kaps
  • 10 MBps sequential
  • Both running Windows NT 7.0?(see below for why)

64
The Disk Farm On a Card
  • The 1 TB disc card
  • An array of discs
  • Can be used as
  • 100 discs
  • 1 striped disc
  • 10 Fault Tolerant discs
  • ....etc
  • LOTS of accesses/second
  • bandwidth

14"
Life is cheap, its the accessories that cost
ya. Processors are cheap, its the peripherals
that cost ya (a 10k disc card).
65
Put Everything in Future (Disk)
Controllers(its not if, its
when?)AcknowledgementsDave Patterson
explained this to me a year ago
Kim Keeton Erik Riedel
Catharine Van Ingen

Helped me sharpen these arguments
66
Technology Drivers Disks
Kilo Mega Giga Tera Peta Exa Zetta Yotta
  • Disks on track
  • 100x in 10 years 2 TB 3.5 drive
  • Shrink to 1 is 200GB
  • Disk replaces tape?
  • Disk is super computer!

67
Data Gravity Processing Moves to
Transducers(moves to data sources sinks)
  • Move Processing to data sources
  • Move to where the power (and sheet metal) is
  • Processor in
  • Modem
  • Display
  • Microphones (speech recognition) cameras
    (vision)
  • Storage Data storage and analysis

68
Its Already True of PrintersPeripheral
CyberBrick
  • You buy a printer
  • You get a
  • several network interfaces
  • A Postscript engine
  • cpu,
  • memory,
  • software,
  • a spooler (soon)
  • and a print engine.

69
Functionally Specialized Cards
P mips processor
  • Storage
  • Network
  • Display

Today P50 mips M 2 MB
M MB DRAM
In a few years P 200 mips M 64 MB
ASIC
ASIC
70
All Device Controllers will be Cray 1s
  • TODAY
  • Disk controller is 10 mips risc engine with 2MB
    DRAM
  • NIC is similar power
  • SOON
  • Will become 100 mips systems with 100 MB DRAM.
  • They are nodes in a federation (can run Oracle
    on NT in disk controller).
  • Advantages
  • Uniform programming model
  • Great tools
  • Security
  • economics (CyberBricks)
  • Move computation to data (minimize traffic)

Central Processor Memory
Tera Byte Backplane
71
Basic Argument for x-Disks
  • Future disk controller is a super-computer.
  • 1 bips processor
  • 128 MB dram
  • 100 GB disk plus one arm
  • Connects to SAN via high-level protocols
  • RPC, HTTP, DCOM, Kerberos, Directory
    Services,.
  • Commands are RPCs
  • Management, security,.
  • Services file/web/db/ requests
  • Managed by general-purpose OS with good dev
    environment
  • Apps in disk saves data movement
  • need programming environment in controller

72
The Slippery Slope
Nothing Sector Server
  • If you add function to server
  • Then you add more function to server
  • Function gravitates to data.

Something Fixed App Server
Everything App Server
73
Why Not a Sector Server?(lets get physical!)
  • Good idea, thats what we have today.
  • But
  • cache added for performance
  • Sector remap added for fault tolerance
  • error reporting and diagnostics added
  • SCSI commends (reserve,.. are growing)
  • Sharing problematic (space mgmt, security,)
  • Slipping down the slope to a 2-D block server

74
Why Not a 1-D Block Server?Put A LITTLE on the
Disk Server
  • Tried and true design
  • HSC - VAX cluster
  • EMC
  • IBM Sysplex (3980?)
  • But look inside
  • Has a cache
  • Has space management
  • Has error reporting management
  • Has RAID 0, 1, 2, 3, 4, 5, 10, 50,
  • Has locking
  • Has remote replication
  • Has an OS
  • Security is problematic
  • Low-level interface moves too many bytes

75
Why Not a 2-D Block Server?Put A LITTLE on the
Disk Server
  • Tried and true design
  • Cedar -gt NFS
  • file server, cache, space,..
  • Open file is many fewer msgs
  • Grows to have
  • Directories Naming
  • Authentication access control
  • RAID 0, 1, 2, 3, 4, 5, 10, 50,
  • Locking
  • Backup/restore/admin
  • Cooperative caching with client
  • File Servers are a BIG hit NetWare
  • SNAP! is my favorite today

76
Why Not a File Server?Put a Little on the Disk
Server
  • Tried and true design
  • Auspex, NetApp, ...
  • Netware
  • Yes, but look at NetWare
  • File interface gives you app invocation interface
  • Became an app server
  • Mail, DB, Web,.
  • Netware had a primitive OS
  • Hard to program, so optimized wrong thing

77
Why Not Everything?Allow Everything on Disk
Server(thin clients)
  • Tried and true design
  • Mainframes, Minis, ...
  • Web servers,
  • Encapsulates data
  • Minimizes data moves
  • Scaleable
  • It is where everyone ends up.
  • All the arguments against are short-term.

78
The Slippery Slope
Nothing Sector Server
  • If you add function to server
  • Then you add more function to server
  • Function gravitates to data.

Something Fixed App Server
Everything App Server
79
Disk Node
  • has magnetic storage (100 GB?)
  • has processor DRAM
  • has SAN attachment
  • has execution environment

Applications
Services
DBMS
File System
RPC, ...
SAN driver
Disk driver
OS Kernel
80
Technology Drivers System on a Chip
  • Integrate Processing with memory on chip
  • chip is 75 memory now
  • 1MB cache gtgt 1960 supercomputers
  • 256 Mb memory chip is 32 MB!
  • IRAM, CRAM, PIM, projects abound
  • Integrate Networking with processing on chip
  • system bus is a kind of network
  • ATM, FiberChannel, Ethernet,.. Logic on chip.
  • Direct IO (no intermediate bus)
  • Functionally specialized cards shrink to a chip.

81
How Do They Talk to Each Other?
  • Each node has an OS
  • Each node has local resources A federation.
  • Each node does not completely trust the others.
  • Nodes use RPC to talk to each other
  • CORBA? DCOM? IIOP? RMI?
  • One or all of the above.
  • Huge leverage in high-level interfaces.
  • Same old distributed system story.

Applications
Applications
datagrams
datagrams
streams
RPC
?
streams
RPC
?
h
VIAL/VIPL
Wire(s)
82
Technology Drivers What if Networking Was as
Cheap As Disk IO?
  • Disk
  • Unix/NT 8 cpu _at_ 40MBps
  • TCP/IP
  • Unix/NT 100 cpu _at_ 40MBps

83
Technology Drivers The Promise of SAN/VIA10x
in 2 years http//www.ViArch.org/
  • Today
  • wires are 10 MBps (100 Mbps Ethernet)
  • 20 MBps tcp/ip saturates 2 cpus
  • round-trip latency is 300 us
  • In the lab
  • Wires are 10x faster Myrinet, Gbps Ethernet,
    ServerNet,
  • Fast user-level communication
  • tcp/ip 100 MBps 10 of each processor
  • round-trip latency is 15 us

84
SAN Standard Interconnect
Gbps Ethernet 110 MBps
  • LAN faster than memory bus?
  • 1 GBps links in lab.
  • 100 port cost soon
  • Port is computer

PCI 70 MBps
UW Scsi 40 MBps
FW scsi 20 MBps
scsi 5 MBps
85
Technology Drivers GBps Ethernet replaces SCSI
  • Why I love SCSI
  • Its fast (30MBps (ultra) to 100 MBps (ultra3))
  • The protocol uses little processor power
  • Why I hate SCSI
  • Wires must be short
  • Cables are pricey
  • pins bend

86
Technology DriversPlug Play Software
  • RPC is standardizing (DCOM, IIOP, HTTP)
  • Gives huge TOOL LEVERAGE
  • Solves the hard problems for you
  • naming,
  • security,
  • directory service,
  • operations,...
  • Commoditized programming environments
  • FreeBSD, Linix, Solaris, tools
  • NetWare tools
  • WinCE, WinNT, tools
  • JavaOS tools
  • Apps gravitate to data.
  • General purpose OS on controller runs apps.

87
Basic Argument for x-Disks
  • Future disk controller is a super-computer.
  • 1 bips processor
  • 128 MB dram
  • 100 GB disk plus one arm
  • Connects to SAN via high-level protocols
  • RPC, HTTP, DCOM, Kerberos, Directory
    Services,.
  • Commands are RPCs
  • management, security,.
  • Services file/web/db/ requests
  • Managed by general-purpose OS with good dev
    environment
  • Move apps to disk to save data movement
  • need programming environment in controller

88
Outline
  • What storage things are coming from Microsoft?
  • TerraServer a 1 TB DB on the Web
  • Storage Metrics Kaps, Maps, Gaps, Scans
  • The future of storage ActiveDisks
  • Papers and Talks at http//research.Microsoft.com
    /Gray
Write a Comment
User Comments (0)
About PowerShow.com