Storage%20Bricks%20Jim%20Gray%20Microsoft%20Research%20http://Research.Microsoft.com/~Gray/talks%20FAST%202002%20Monterey,%20CA,%2029%20Jan%202002%20Acknowledgements:%20Dave%20Patterson%20explained%20this%20to%20me%20long%20ago%20%20%20%20%20%20Leonard%20Chung%20%20Kim%20Keeton%20%20%20%20%20%20%20%20Erik - PowerPoint PPT Presentation

About This Presentation
Title:

Storage%20Bricks%20Jim%20Gray%20Microsoft%20Research%20http://Research.Microsoft.com/~Gray/talks%20FAST%202002%20Monterey,%20CA,%2029%20Jan%202002%20Acknowledgements:%20Dave%20Patterson%20explained%20this%20to%20me%20long%20ago%20%20%20%20%20%20Leonard%20Chung%20%20Kim%20Keeton%20%20%20%20%20%20%20%20Erik

Description:

Moving to sheet metal ? The end of computers ? 7. It's Already True of Printers ... Music/Video/Photo appliance (home) Game appliance 'PC' File server appliance ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Storage%20Bricks%20Jim%20Gray%20Microsoft%20Research%20http://Research.Microsoft.com/~Gray/talks%20FAST%202002%20Monterey,%20CA,%2029%20Jan%202002%20Acknowledgements:%20Dave%20Patterson%20explained%20this%20to%20me%20long%20ago%20%20%20%20%20%20Leonard%20Chung%20%20Kim%20Keeton%20%20%20%20%20%20%20%20Erik


1
Storage Bricks Jim Gray Microsoft
Research http//Research.Microsoft.com/Gray/talk
s FAST 2002 Monterey, CA, 29 Jan
2002AcknowledgementsDave Patterson explained
this to me long ago Leonard Chung
Kim Keeton Erik Riedel Catharine
Van Ingen

Helped me sharpen these arguments
2
First Disk 1956
  • IBM 305 RAMAC
  • 4 MB
  • 50x24 disks
  • 1200 rpm
  • 100 ms access
  • 35k/y rent
  • Included computer accounting software(tubes
    not transistors)

3
10 years later
1.6 meters
4
Disk Evolution
Kilo Mega Giga Tera Peta Exa Zetta Yotta
  • Capacity100x in 10 years 1 TB 3.5 drive in
    2005 20 GB 1 micro-drive
  • System on a chip
  • High-speed SAN
  • Disk replacing tape
  • Disk is super computer!

5
Disks are becoming computers
  • Smart drives
  • Camera with micro-drive
  • Replay / Tivo / Ultimate TV
  • Phone with micro-drive
  • MP3 players
  • Tablet
  • Xbox
  • Many more

ApplicationsWeb, DBMS, Files OS
Disk Ctlr 1Ghz cpu 1GB RAM
Comm Infiniband, Ethernet, radio
6
Data Gravity Processing Moves to Transducers
smart displays, microphones, printers, NICs,
disks
Processing decentralized Moving to data
sources Moving to power sources Moving to sheet
metal ? The end of computers ?
  • Storage
  • Network
  • Display

7
Its Already True of PrintersPeripheral
CyberBrick
  • You buy a printer
  • You get a
  • several network interfaces
  • A Postscript engine
  • cpu,
  • memory,
  • software,
  • a spooler (soon)
  • and a print engine.

8
The Absurd Design?
  • Segregate processing from storage
  • Poor locality
  • Much useless data movement
  • Amdahls laws bus 10 B/ips io 1 b/ips

Disks
Processors
100 GBps
10 TBps
1 Tips
100TB
9
The Absurd Disk
  • 2.5 hr scan time (poor sequential access)
  • 1 aps / 5 GB (VERY cold data)
  • Its a tape!
  • Optimizations
  • Reduce management costs
  • Caching
  • Sequential 100x faster than random

200
1 TB
100 MB/s
200 Kaps
10
Disk Node
  • magnetic storage (1TB)
  • processor RAM LAN
  • Management interface (HTTP SOAP)
  • Application execution environment
  • Application
  • File
  • DB2/Oracle/SQL
  • Notes/Exchange/TeamServer
  • SAP/Seibold/
  • Quickbooks /Tivo/ PC.

Applications
Services
DBMS
File System
RPC, ...
LAN driver
Disk driver
OS Kernel
11
Implications
Conventional
Radical
  • Move app to NIC/device controller
  • higher-higher level protocols SOAP/DCOM/RMI..
  • Cluster parallelism is VERY important.
  • Offload device handling to NIC/HBA
  • higher level protocols I2O, NASD, VIA, IP, TCP
  • SMP and Cluster parallelism is important.

12
Intermediate Step Shared Logic
  • Brick with 8-12 disk drives
  • 200 mips/arm (or more)
  • 2xGbpsEthernet
  • General purpose OS
  • 10k/TB to 50k/TB
  • Shared
  • Sheet metal
  • Power
  • Support/Config
  • Security
  • Network ports
  • These bricks could run applications (e.g. SQL or
    Mail or..)

Snap 1TB 12x80GB NAS
NetApp .5TB 8x70GB NAS
Maxstor 2TB 12x160GB NAS
13
Example
  • Homogenous machines leads to quick response
    through reallocation
  • HP desktop machines, 320MB RAM, 3u high, 4 100GB
    IDE Drives
  • 4k/TB (street),
  • 2.5processors/TB, 1GB RAM/TB
  • JIT storage processing3 weeks from order to
    deploy

Slide courtesy of Brewster Kahle, _at_ Archive.org
14
What if Disk Replaces Tape?How does it work?
  • Backup/Restore
  • RAID (among the federation)
  • Snapshot copies (in most OSs)
  • remote replicas (standard in DBMS and FS)
  • Archive
  • Use cold 95 of disk space
  • Interchange
  • Send computers not disks.

15
Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.
  • At 1GBps it takes 12 days!
  • Store it in two (or more) places online A
    geo-plex
  • Scrub it continuously (look for errors)
  • On failure,
  • use other copy until failure repaired,
  • refresh lost copy from safe copy.
  • Can organize the two copies differently
    (e.g. one by time, one by space)

16
Archive to Disk100TB for 0.5M 1.5 free
petabytes
  • If you have 100 TB active you need 10,000
    mirrored disk arms (see tpcC)
  • So you have 1.6 PB of (mirrored) storage (160GB
    drives)
  • Use the empty 95 for archive storage.
  • No extra space or extra power cost.
  • Very fast access (milliseconds vs hours).
  • Snapshot is read-only (software enforced ?)
  • Makes Admin easy (saves people costs)

17
Disk as Tape Archive
Slide courtesy of Brewster Kahle, _at_ Archive.org
  • Tape is unreliable, specialized, slow, low
    density, not improving fast, and expensive
  • Using removable hard drives to replace tapes
    function has been successful
  • When a tape is needed, the drive is put in a
    machine and it is online. No need to copy from
    tape before it is used.
  • Portable, durable, fast, media cost raw tapes,
    dense. Unknown longevity suspected good.

18
Disk as Tape Interchange
  • Tape interchange is frustrating (often
    unreadable)
  • Beyond 1-10 GB send media not data
  • FTP takes too long (hour/GB)
  • Bandwidth still very expensive (1/GB)
  • Writing DVD not much faster than Internet
  • New technology could change this
  • 100 GB DVD _at_ 10MBps would be competitive.
  • Write 1TB disk in 2.5 hrs (at 100MBps)
  • But, how does interchange work?

19
Disk As Tape Interchange What format?
  • Today I send 160GB NTFS/SQL disks.
  • But that is not a good format for Linux/DB2
    users.
  • Solution Ship NFS/CIFS/ODBC servers (not disks)
  • Plug disk into LAN.
  • DHCP then file or DB server via standard
    interface.
  • pull data from server.

20
Some Questions
  • What is the product?
  • How do I manage 10,000 nodes (disks)?
  • How do I program 10,000 nodes (disks)?
  • How does RAID work?
  • How do I backup a PB?
  • How do I restore a PB?

21
What is the Product?
  • Concept Plug it in and it works!
  • Music/Video/Photo appliance (home)
  • Game appliance
  • PC
  • File server appliance
  • Data archive/interchange appliance
  • Web server appliance
  • DB server
  • eMail appliance
  • Application appliance

network
power
22
How Does Scale Out Work?
  • Files well known designs
  • rooted tree partitioned across nodes
  • Automatic cooling (migration)
  • Mirrors or Chained declustering
  • Snapshots for backup/archive
  • Databases well known designs
  • Partitioning, remote replication similar to files
  • distributed query processing.
  • Applications (hypothetical)
  • Must be designed as mobile objects
  • Middleware provides object migration system
  • Objects externalize methods to migrate (
    backup/restore/archive)
  • Web services seem to have key ideas (xml
    representation)
  • Example eMail object is mailbox

23
Auto Manage Storage
  • 1980 rule of thumb
  • A DataAdmin per 10GB, SysAdmin per mips
  • 2000 rule of thumb
  • A DataAdmin per 5TB
  • SysAdmin per 100 clones (varies with app).
  • Problem
  • 5TB is 50k today, 5k in a few years.
  • Admin cost gtgt storage cost !!!!
  • Challenge
  • Automate ALL storage admin tasks

24
Admin TB and guessed /TB(does not include
cost of application, overhead, not substance)
  • Google 1 100TB 5k/TB/y
  • Yahoo! 1 50TB 20k/TB/y
  • DB 1 5TB 60k/TB/y
  • Wall St. 1 1TB 400k/TB/y (reported)
  • hardware dominant cost only _at_ Google.
  • How can we waste hardware to save people cost?

25
How do I manage 10,000 nodes?
  • You cant manage 10,000 x (for any x).
  • They manage themselves.
  • You manage exceptional exceptions.
  • Auto Manage
  • Plug Play hardware
  • Auto-load balance placement storage
    processing
  • Simple parallel programming model
  • Fault masking

26
How do I program 10,000 nodes?
  • You cant program 10,000 x (for any x).
  • They program themselves.
  • You write embarrassingly parallel programs
  • Examples SQL, Web, Google, Inktomi, HotMail,.
  • PVM and MPI prove it must be automatic (unless
    you have a PhD)!
  • Auto Parallelism is ESSENTIAL

27
Summary
  • Disks will become supercomputers so
  • Lots of computing to optimize the arm
  • Can put app close to the data (better modularity,
    locality)
  • Storage appliances (self-organizing)
  • The arm/capacity tradeoff waste space to save
    access.
  • Compression (saves bandwidth)
  • Mirrors
  • Online backup/restore
  • Online archive (vault to other drives or geoplex
    if possible)
  • Not disks replace tapes Storage appliances
    replace tapes.
  • Self-organizing storage servers (file
    systems)(prototypes of this software exist)
Write a Comment
User Comments (0)
About PowerShow.com