Title: Storage%20Bricks%20Jim%20Gray%20Microsoft%20Research%20http://Research.Microsoft.com/~Gray/talks%20FAST%202002%20Monterey,%20CA,%2029%20Jan%202002%20Acknowledgements:%20Dave%20Patterson%20explained%20this%20to%20me%20long%20ago%20%20%20%20%20%20Leonard%20Chung%20%20Kim%20Keeton%20%20%20%20%20%20%20%20Erik
1Storage Bricks Jim Gray Microsoft
Research http//Research.Microsoft.com/Gray/talk
s FAST 2002 Monterey, CA, 29 Jan
2002AcknowledgementsDave Patterson explained
this to me long ago Leonard Chung
Kim Keeton Erik Riedel Catharine
Van Ingen
Helped me sharpen these arguments
2First Disk 1956
- IBM 305 RAMAC
- 4 MB
- 50x24 disks
- 1200 rpm
- 100 ms access
- 35k/y rent
- Included computer accounting software(tubes
not transistors)
310 years later
1.6 meters
4Disk Evolution
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- Capacity100x in 10 years 1 TB 3.5 drive in
2005 20 GB 1 micro-drive - System on a chip
- High-speed SAN
- Disk replacing tape
- Disk is super computer!
5Disks are becoming computers
- Smart drives
- Camera with micro-drive
- Replay / Tivo / Ultimate TV
- Phone with micro-drive
- MP3 players
- Tablet
- Xbox
- Many more
ApplicationsWeb, DBMS, Files OS
Disk Ctlr 1Ghz cpu 1GB RAM
Comm Infiniband, Ethernet, radio
6Data Gravity Processing Moves to Transducers
smart displays, microphones, printers, NICs,
disks
Processing decentralized Moving to data
sources Moving to power sources Moving to sheet
metal ? The end of computers ?
7Its Already True of PrintersPeripheral
CyberBrick
- You buy a printer
- You get a
- several network interfaces
- A Postscript engine
- cpu,
- memory,
- software,
- a spooler (soon)
- and a print engine.
8The Absurd Design?
- Segregate processing from storage
- Poor locality
- Much useless data movement
- Amdahls laws bus 10 B/ips io 1 b/ips
Disks
Processors
100 GBps
10 TBps
1 Tips
100TB
9The Absurd Disk
- 2.5 hr scan time (poor sequential access)
- 1 aps / 5 GB (VERY cold data)
- Its a tape!
- Optimizations
- Reduce management costs
- Caching
- Sequential 100x faster than random
200
1 TB
100 MB/s
200 Kaps
10Disk Node
- magnetic storage (1TB)
- processor RAM LAN
- Management interface (HTTP SOAP)
- Application execution environment
- Application
- File
- DB2/Oracle/SQL
- Notes/Exchange/TeamServer
- SAP/Seibold/
- Quickbooks /Tivo/ PC.
Applications
Services
DBMS
File System
RPC, ...
LAN driver
Disk driver
OS Kernel
11Implications
Conventional
Radical
- Move app to NIC/device controller
- higher-higher level protocols SOAP/DCOM/RMI..
- Cluster parallelism is VERY important.
- Offload device handling to NIC/HBA
- higher level protocols I2O, NASD, VIA, IP, TCP
- SMP and Cluster parallelism is important.
12Intermediate Step Shared Logic
- Brick with 8-12 disk drives
- 200 mips/arm (or more)
- 2xGbpsEthernet
- General purpose OS
- 10k/TB to 50k/TB
- Shared
- Sheet metal
- Power
- Support/Config
- Security
- Network ports
- These bricks could run applications (e.g. SQL or
Mail or..)
Snap 1TB 12x80GB NAS
NetApp .5TB 8x70GB NAS
Maxstor 2TB 12x160GB NAS
13Example
- Homogenous machines leads to quick response
through reallocation - HP desktop machines, 320MB RAM, 3u high, 4 100GB
IDE Drives - 4k/TB (street),
- 2.5processors/TB, 1GB RAM/TB
- JIT storage processing3 weeks from order to
deploy
Slide courtesy of Brewster Kahle, _at_ Archive.org
14What if Disk Replaces Tape?How does it work?
- Backup/Restore
- RAID (among the federation)
- Snapshot copies (in most OSs)
- remote replicas (standard in DBMS and FS)
- Archive
- Use cold 95 of disk space
- Interchange
- Send computers not disks.
15Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.
- At 1GBps it takes 12 days!
- Store it in two (or more) places online A
geo-plex - Scrub it continuously (look for errors)
- On failure,
- use other copy until failure repaired,
- refresh lost copy from safe copy.
- Can organize the two copies differently
(e.g. one by time, one by space)
16Archive to Disk100TB for 0.5M 1.5 free
petabytes
- If you have 100 TB active you need 10,000
mirrored disk arms (see tpcC) - So you have 1.6 PB of (mirrored) storage (160GB
drives) - Use the empty 95 for archive storage.
- No extra space or extra power cost.
- Very fast access (milliseconds vs hours).
- Snapshot is read-only (software enforced ?)
- Makes Admin easy (saves people costs)
17Disk as Tape Archive
Slide courtesy of Brewster Kahle, _at_ Archive.org
- Tape is unreliable, specialized, slow, low
density, not improving fast, and expensive - Using removable hard drives to replace tapes
function has been successful - When a tape is needed, the drive is put in a
machine and it is online. No need to copy from
tape before it is used. - Portable, durable, fast, media cost raw tapes,
dense. Unknown longevity suspected good. -
18Disk as Tape Interchange
- Tape interchange is frustrating (often
unreadable) - Beyond 1-10 GB send media not data
- FTP takes too long (hour/GB)
- Bandwidth still very expensive (1/GB)
- Writing DVD not much faster than Internet
- New technology could change this
- 100 GB DVD _at_ 10MBps would be competitive.
- Write 1TB disk in 2.5 hrs (at 100MBps)
- But, how does interchange work?
19Disk As Tape Interchange What format?
- Today I send 160GB NTFS/SQL disks.
- But that is not a good format for Linux/DB2
users. - Solution Ship NFS/CIFS/ODBC servers (not disks)
- Plug disk into LAN.
- DHCP then file or DB server via standard
interface. - pull data from server.
20Some Questions
- What is the product?
- How do I manage 10,000 nodes (disks)?
- How do I program 10,000 nodes (disks)?
- How does RAID work?
- How do I backup a PB?
- How do I restore a PB?
21What is the Product?
- Concept Plug it in and it works!
- Music/Video/Photo appliance (home)
- Game appliance
- PC
- File server appliance
- Data archive/interchange appliance
- Web server appliance
- DB server
- eMail appliance
- Application appliance
network
power
22How Does Scale Out Work?
- Files well known designs
- rooted tree partitioned across nodes
- Automatic cooling (migration)
- Mirrors or Chained declustering
- Snapshots for backup/archive
- Databases well known designs
- Partitioning, remote replication similar to files
- distributed query processing.
- Applications (hypothetical)
- Must be designed as mobile objects
- Middleware provides object migration system
- Objects externalize methods to migrate (
backup/restore/archive) - Web services seem to have key ideas (xml
representation) - Example eMail object is mailbox
23Auto Manage Storage
- 1980 rule of thumb
- A DataAdmin per 10GB, SysAdmin per mips
- 2000 rule of thumb
- A DataAdmin per 5TB
- SysAdmin per 100 clones (varies with app).
- Problem
- 5TB is 50k today, 5k in a few years.
- Admin cost gtgt storage cost !!!!
- Challenge
- Automate ALL storage admin tasks
24Admin TB and guessed /TB(does not include
cost of application, overhead, not substance)
- Google 1 100TB 5k/TB/y
- Yahoo! 1 50TB 20k/TB/y
- DB 1 5TB 60k/TB/y
- Wall St. 1 1TB 400k/TB/y (reported)
- hardware dominant cost only _at_ Google.
- How can we waste hardware to save people cost?
25How do I manage 10,000 nodes?
- You cant manage 10,000 x (for any x).
- They manage themselves.
- You manage exceptional exceptions.
- Auto Manage
- Plug Play hardware
- Auto-load balance placement storage
processing - Simple parallel programming model
- Fault masking
26How do I program 10,000 nodes?
- You cant program 10,000 x (for any x).
- They program themselves.
- You write embarrassingly parallel programs
- Examples SQL, Web, Google, Inktomi, HotMail,.
- PVM and MPI prove it must be automatic (unless
you have a PhD)! - Auto Parallelism is ESSENTIAL
27Summary
- Disks will become supercomputers so
- Lots of computing to optimize the arm
- Can put app close to the data (better modularity,
locality) - Storage appliances (self-organizing)
- The arm/capacity tradeoff waste space to save
access. - Compression (saves bandwidth)
- Mirrors
- Online backup/restore
- Online archive (vault to other drives or geoplex
if possible) - Not disks replace tapes Storage appliances
replace tapes. - Self-organizing storage servers (file
systems)(prototypes of this software exist)